Department of Computer Science

EXTRACTION AND TRANSFORMATION OF DATA FROM SEMI-STRUCTURED TEXT FILES USING A DECLARATIVE APPROACH by

RICARDO FORTUNA RAMINHOS

Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer Science

Supervisor: PhD João Moura Pires

Monte de Caparica, June 2007

Sumário A problemática do ETL – Extracção (Extraction), Transformação (Transformation) e Carregamento (Loading) está a tornar-se progressivamente menos específica do domínio tradicional do data-warehousing, sendo estendida para o processamento de dados sob a forma textual. A Internet surge como uma grande fonte de informação textual, seguindo um formato semi-estruturado e facilmente compreensível, referindo-se a múltiplos domínios, alguns dos quais altamente complexos. Abordagens tradicionais ao ETL através do desenvolvimento específico de código fonte para cada repositório de dados e baseadas em múltiplas interacções entre peritos do domínio e da informática tornam-se soluções inadequadas, propícias a demorarem longos períodos de tempo e fáceis de incorrer em erro. Uma nova abordagem ao ETL é proposta, baseada na sua decomposição em duas fases: ETD (Extracção, Transformação e Entrega de Dados) seguida de IL (Integração e Carregamento). A proposta ETD é suportada por uma linguagem declarativa para a representação das expressões ETD e uma aplicação gráfica para interacção com o perito de domínio. Aquando da fase ETD é necessário sobretudo conhecimento de domínio, enquanto que o conhecimento informático será centrado na fase IL, atribuindo os dados processados às aplicações de destino, permitindo uma separação clara dos vários tipos de conhecimento existentes. Seguindo a abordagem ETD+IL, a arquitectura para um Módulo de Processamento de Dados (DPM) é proposta, oferecendo uma solução completa para o processamento de dados desde o momento em que um ficheiro é adquirido (também através de uma abordagem declarativa) até à entrega dos dados. Um conjunto de ferramentas gráficas também é proposto, permitindo a monitorização, controlo e rastreio dos dados nos vários passos da solução. A abordagem ETD+IL foi implementada, integrada, testada e validada no contexto de uma solução de processamento de dados para um sistema do domínio espacial, actualmente operacional na Agência Espacial Europeia para a missão Galileo.

- III-

Abstract The Extraction, Transformation and Loading (ETL) problematic is becoming progressively less specific to the traditional data-warehousing domain and is being extended to the processing of textual data. The World Wide Web (WWW) appears as a major source of textual information, following a human-readable semi-structured format, referring to multiple domains, some of them highly complex. Traditional ETL approaches following the development of specific source code for each data source and based on multiple domain / computer-science experts interactions, become an inadequate solution, time consuming and prone to error. A novel approach to ETL is proposed, based on its decomposition in two phases: ETD (Extraction, Transformation and Data Delivery) followed by IL (Integration and Loading). The ETD proposal is supported by a declarative language for expressing ETD statements and a graphical application for interacting with the domain expert. When applying ETD, mainly domain expertise is required, while computer-science expertise will be centred in the IL phase, linking the processed data to target system models, enabling a clearer separation of concerns. Following the ETD+IL approach a declarative Data Processing Module (DPM) architecture is proposed that offers a complete data processing solution, from file download (also using a declarative approach) to data delivery. A set of graphical tools are also proposed that enable the monitoring, control and traceability of data through the whole data processing solution. The ETD+IL approach has been implemented, integrated, tested and validated in a full data processing solution for a space domain system, currently operational at the European Space Agency for the Galileo Mission.

- V-

Dedicado ao principal entusiasta da minha tese… O meu Avô

- VII-

Agradecimentos Para a Lisa, pelo teu constante apoio e fantástica paciência durante os dois anos que esta tese demorou a concluir. Prometo não iniciar o doutoramento nos próximos tempos...

Para os meus pais e irmã – Abílio, Clara e Raquel – pelo vosso apoio, educação e constante presença na minha formação enquanto ser humano.

Para o meu amigo e orientador de tese Professor João Moura-Pires, que ensinou-me muito enquanto engenheiro e sobretudo enquanto pessoa.

Para o meu amigo e colega Ricardo Ferreira, pela sua competência técnica e postura sempre descontraída durante o nosso eforço conjunto no desenvolvimento do sistema SESS.

Para o meu amigo e colega Nuno Viana, que introduziu-me no fantástico mundo do ETL, tendo ainda paciência para rever esta tese.

Para o grupo de investigação CA3 – a minha primeira experiência profissional – onde fui bem acolhido por todos os seus investigadores e com os quais aprendi bastante. Um agradecimento especial à professora Rita Ribeiro, por confiar e apoiar o meu trabalho nos muitos projectos em que participei no CA3: CESADS, EO-KES, SEIS, MODI e SESS.

Para os colegas do gabinete 236 – André Marques e Sérgio Agostinho – que ajudaramme a manter a sanidade mental nos últimos meses da tese. Espero que concluam as vossas teses em breve.

A todos vós o meu muito obrigado!

- IX-

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.

A Scandal in Bohemia Sir Arthur Conan Doyle

- XI-

Acronyms Table 1.1: List of acronyms Acronym 3GL 3M API ASCII BI CA3 CASE CDC CDI CESADS COBOL COTS CPU CRC CSV CVS CWM DBMS DDMS DIM DM DPM DSP DTD DW EAI ECM EDR EID EII ELT Envisat EO-KES ESA ESOC ESTRACK ETD ETL ETLT FCT (1) FCT (2) FET FFD FFDE FOP FR FTP GIOVE GLONASS

Description Third Generation Language Multi Mission Module Application Programmers’ Interface American Standard Code for Information Interchange Business Intelligence Soft-Computing and Autonomous Agents (Research Group) Computer-Aided Software Engineering Change Data Capture Customer Data Integration Centralised ESTRACK Status and Diagnostic of Intelligent Systems. Common Business Oriented Language Commercial Of the Shelf Central Processing Unit Cyclic Redundancy Check Comma Separated Values Concurrent Version System Common Warehouse Model Database Management System Distributed Database Management System Data Integration Module Data Mart Data Processing Module Data Service Provider Document Type Definition Data Warehouse Enterprise Application Integration Enterprise Content Management Enterprise Data Replication External Identifier Enterprise Information Integration Extraction, Loading and Transformation Environmental Satellite Earth Observation domain specific Knowledge Enabled Services European Space Agency European Space Operations Centre European Space Tracking Network Extraction, Transformation and Data Delivery Extraction, Transformation and Loading Extraction, Transformation, Loading and Transformation “Faculdade de Ciências e Tecnologia” Flight Control Team File Extractor and Transformer File Format Definition File Format Definition Editor Flight Operations Plan File Retriever File Transfer Protocol Galileo In-Orbit Validation Element Global Navigation Satellite System - XIII-

GOES GPC GPS GUI HMI HTML HTTP I/O IIS IL INTEGRAL IT J2EE JDBC JDK JMS JSP JVM KB MB MEO MODI MOF MOLAP MR MS MT NOAA / NGDC NOAA / SEC ODBC ODS OLAP OMG PCRE PDF PF PHA PHP PS RAT RDBMS RGB RSS RTF S/C S/W SADL SE (1) SE (2) SEIS SESS - XIV -

Geostationary Operations Environmental Satellite Ground Program Control Global Positioning System Graphical User Interface Human Machine Interface Hyper Text Mark-up Language Hyper Text Transfer Protocol Input / Output Internet Information System Integration and Loading International Gamma-Ray Astrophysics Laboratory (satellite) Information Technology Java 2 Enterprise Edition Java DataBase Connection Java SE Development Toolkit Java Message Service Java Server Pages Java Virtual Machine Kilo Byte Mega Byte Medium Earth Orbit Simulation of Knowledge Enabled Monitoring and Diagnosis Tool for Mars Lander Payloads (Monitoring and Diagnosis for Mars Driller) Meta Object Facility Multidimensional On-Line Analytical Processing Metadata Repository Microsoft Monitoring Tool National Oceanic & Atmospheric Administration / National Geophysical Data Centre National Oceanic and Atmospheric Administration / Space Environment Centre Open DataBase Connectivity Operational Data Storage On-Line Analytical Processing Open Management Group Perl Compatible Regular Expressions Portable Document Format Provided File Potentially Hazardous Asteroids PHP: Hypertext Preprocessor Post Script Report and Analysis Tool Relational Database Management Systems Red / Green / Blue Really Simple Syndication Rich Text Format Spacecraft Space Weather Simple Activity Definition Language Space Effects Second Edition Space Environment Information System for Mission Control Purposes Space Environment Support System for Telecom and Navigation

SEU SGML SIDC SMS SOAP SOHO SPENVIS SQL SREM SSIS TM UDAP UDET UDOB UML UNL URL W3C WSDL WWW XADL XMI XML XMM XPath XQuery XSL XSLT

Systems Single Event Upset Standard Generalized Markup Language Sunspot Index Data Centre Short / Simple Message Service Simple Object Access Protocol Solar and Heliospheric Observatory (satellite) Space Environment Information System Structured Query Language Standard Radiation Monitor SQL Server Integration Services Telemetry Uniform Data Access Proxy Uniform Data Extraction and Transformer Uniform Data Output Buffer Uniform Modelling Language “Universidade Nova de Lisboa” Uniform Resource Locator World Wide Web Consortium Web Service Definition Language World Wide Web XML-based Activity Definition Language XML Metadata Interchange eXtended Markup Language X-Ray Multi-Mission (satellite) XML Path Language XML Query Language Extensible Stylesheet Language XSL Transformations

- XV-

Index CHAPTER 1

INTRODUCTION ................................................................ 1

1.1

SEMI-STRUCTURED TEXT DATA ....................................................................... 2

1.2

ETL APPLIED TO SEMI-STRUCTURED TEXT DATA................................................. 4

1.3

SEIS

1.4

THESIS OVERVIEW ....................................................................................... 5

AND

SESS SYSTEMS ............................................................................. 4

1.4.1

Goals............................................................................................ 5

1.4.2

Contributions ................................................................................. 6

1.5

THESIS STRUCTURE ...................................................................................... 7

1.6

CONVENTIONS ............................................................................................. 9 1.6.1

Textual Notations ......................................................................... 10

1.6.2

Uniform Modelling Language .......................................................... 10

CHAPTER 2

RELATED WORK............................................................... 11

2.1

THE CORRECT ETL TOOL.............................................................................. 12

2.2

ETL CONCEPTUAL REPRESENTATION AND FRAMEWORK ....................................... 13

2.3

2.4

2.2.1

AJAX .......................................................................................... 13

2.2.2

Meta Object Facility ...................................................................... 15

2.2.3

Common Warehouse Metamodel..................................................... 16

CLASSICAL DATA INTEGRATION ARCHITECTURES ............................................... 17 2.3.1

Hand Coding................................................................................ 17

2.3.2

Code Generators .......................................................................... 17

2.3.3

Database Embedded ETL ............................................................... 18

2.3.4

Metadata Driven ETL Engines ......................................................... 18

APPROACHES TO DATA PROCESSING ............................................................... 19 2.4.1

Data Consolidation ....................................................................... 19

2.4.2

Data Federation ........................................................................... 21

2.4.3

Data Propagation ......................................................................... 22

2.4.4

Hybrid Approach .......................................................................... 22

2.4.5

Change Data Capture.................................................................... 23

2.4.6

Data Integration Technologies........................................................ 23

2.5

METADATA FOR DESCRIBING ETL STATEMENTS................................................. 28

2.6

ETL MARKET ANALYSIS ............................................................................... 31

- XVI -

2.6.1

METAspectrum Market Summary .................................................... 31

2.6.2

Gartner Market Summary .............................................................. 33

2.7

ETL – STATE OF THE ART REPORT ................................................................. 34

2.8

SPACE ENVIRONMENT INFORMATION SYSTEM FOR MISSION CONTROL PURPOSES ..... 36

2.9

2.8.1

Objectives ................................................................................... 37

2.8.2

Architecture................................................................................. 37

2.8.3

Data Processing Module................................................................. 38

2.8.4

Evaluation ................................................................................... 39

CONCLUSIONS ........................................................................................... 41

CHAPTER 3

DECOMPOSING ETL: THE ETD + IL APPROACH .................51

3.1

CLASSICAL ETL

SOLUTIONS ..........................................................................

52

3.2

THESIS: THE ETD+IL APPROACH ................................................................. 54

3.3

REQUIREMENTS FOR ETD ............................................................................. 55 3.3.1

Free, Open Source and Independent ............................................... 55

3.3.2

Completeness .............................................................................. 56

3.3.3

Separation of Concerns ................................................................. 56

3.3.4

User Friendliness .......................................................................... 57

3.3.5

Performance ................................................................................ 57

3.3.6

Scalability ................................................................................... 58

3.3.7

Modularity ................................................................................... 58

3.3.8

Reusability .................................................................................. 59

3.3.9

Metadata Driven........................................................................... 59

3.3.10

Correctness ................................................................................ 60

3.3.11

Validation................................................................................... 60

3.3.12

Data Traceability ......................................................................... 61

3.3.13

Fault Tolerance ........................................................................... 61

CHAPTER 4

DATA PROCESSING MODULE ............................................63

4.1

TECHNOLOGIES .......................................................................................... 64 4.1.1

XML ............................................................................................ 64

4.1.2

XML Schema ................................................................................ 65

4.1.3

XPath ......................................................................................... 65

4.1.4

XSLT .......................................................................................... 65

- XVII-

4.2

4.3

4.1.5

XQuery ....................................................................................... 66

4.1.6

Apache Tomcat HTTP Server .......................................................... 66

4.1.7

SOAP.......................................................................................... 66

4.1.8

Web Services............................................................................... 66

4.1.9

Java ........................................................................................... 66

4.1.10

Regular Expressions .................................................................... 67

4.1.11

Applying the Technologies ............................................................ 67

DATA PROCESSING MODULE ARCHITECTURE ..................................................... 68 4.2.1

Scalability ................................................................................... 70

4.2.2

Load Balancing ............................................................................ 72

FILE RETRIEVER ENGINE .............................................................................. 73 4.3.1

Main Metadata Concepts................................................................ 77

4.4

ETD ENGINE ............................................................................................. 83

4.5

DATA DELIVERY INTERFACE .......................................................................... 86

4.6

FFD EDITOR ............................................................................................. 88

4.7

DPM CONSOLE .......................................................................................... 89

4.8

LOG ANALYSER .......................................................................................... 91

4.9

SUMMARY

CHAPTER 5 5.1

5.2

5.3

THE FILE FORMAT DEFINITION LANGUAGE AND EDITOR 93

THE FILE FORMAT DEFINITION LANGUAGE ....................................................... 94 5.1.1

Model ......................................................................................... 95

5.1.2

XML Schema Implementation......................................................... 98

5.1.3

Transformation Plugins.................................................................107

THE FFD EDITOR ..................................................................................... 108 5.2.1

Menu Functionalities and General Metadata .....................................112

5.2.2

Extraction ..................................................................................114

5.2.3

Transformation ...........................................................................126

5.2.4

Data Delivery .............................................................................128

5.2.5

Validation...................................................................................130

5.2.6

Recovery and Debug....................................................................131

LANGUAGE EXPRESSIVENESS AND EXTENSIBILITY ............................................ 132

CHAPTER 6 6.1

............................................................................................. 92

CASE STUDIES............................................................... 133

VERSATILITY FOR MULTIPLE DOMAINS.......................................................... 134

- XVIII -

6.2

6.1.1

Stock Trading Domain ................................................................. 134

6.1.2

Banking Domain ......................................................................... 135

6.1.3

Geological Domain ...................................................................... 136

6.1.4

Physical and Spatial Domains ....................................................... 137

SPACE ENVIRONMENT SUPPORT SYSTEM FOR TELECOM / NAVIGATION MISSIONS .. 139 6.2.1

Galileo Mission ........................................................................... 139

6.2.2

Objectives ................................................................................. 141

6.2.3

General Architecture ................................................................... 141

6.2.4

ETD+IL Integration and Usage in SESS.......................................... 143

CHAPTER 7

EVALUATION AND CONCLUSIONS ..................................149

7.1

CONCEPTUAL APPROACH EVALUATION ........................................................... 150

7.2

REQUIREMENTS FOR ETD ........................................................................... 151 7.2.1

Free, Open Source and Independent ............................................. 151

7.2.2

Completeness ............................................................................ 151

7.2.3

Separation of Concerns ............................................................... 152

7.2.4

User Friendliness ........................................................................ 152

7.2.5

Performance .............................................................................. 152

7.2.6

Scalability ................................................................................. 153

7.2.7

Modularity ................................................................................. 153

7.2.8

Reusability ................................................................................ 153

7.2.9

Metadata Driven......................................................................... 154

7.2.10

Correctness .............................................................................. 154

7.2.11

Validation................................................................................. 155

7.2.12

Data Traceability ....................................................................... 155

7.2.13

Fault Tolerance ......................................................................... 155

7.3

CONCLUSIONS ......................................................................................... 156

7.4

FUTURE WORK ......................................................................................... 156

CHAPTER 8 8.1

REFERENCES ..................................................................159

REFERENCES ........................................................................................... 160

- XIX-

ANNEXES SESS DATA PROCESSING MODULE REQUIREMENTS …………………………….…. 163 AVAILABLE TRANSFORMATION OPERATIONS ……………………………………….. 166 REGULAR EXPRESSION LIBRARY (XML INSTANCE) ….……………………………. 167 SESS FILE FORMAT DEFINITION STATISTICS ……………………………………….. 168

- XX -

Index of Figures Figure 1.1: Part of a text file example containing exchange rates data [3] ....................................................2 Figure 1.2: Part of a text file example of solar activity events [2] ................................................................3 Figure 1.3: A text file example of flare, magnetic and proton forecasts [2] ...................................................3 Figure 2.1: A two-level framework (example for a library) ........................................................................ 14 Figure 2.2: MOF layers using UML and Java as comparison ....................................................................... 15 Figure 2.3: Common Warehouse Metamodel structure [24]....................................................................... 16 Figure 2.4: Data consolidation, federation and propagation....................................................................... 19 Figure 2.5: Push and pull modes of data consolidations ............................................................................ 20 Figure 2.6: Part of the scenario expressed with XADL............................................................................... 29 Figure 2.7: The syntax of SADL ............................................................................................................. 30 Figure 2.8: Part of the scenario expressed with SADL............................................................................... 30 Figure 2.9: METAspectrum evaluation [33] ............................................................................................. 32 Figure 2.10: Magic quadrant [34] for extraction, transformation and loading............................................... 34 Figure 2.11: SEIS system architecture modular breakdown....................................................................... 38 Figure 2.12: An abstract ETL architecture ............................................................................................... 41 Figure 2.13: Group1 Data Flow architecture [46-48] ................................................................................ 42 Figure 2.14: Sybase Transform On Demand architecture [49-51] .............................................................. 42 Figure 2.15: Sunopsis architecture [52-55] ............................................................................................. 43 Figure 2.16: DB Software Laboratory’s Visual Importer Mapping [56, 57] ................................................... 44 Figure 2.17: iWay Data Migrator [58-61] ................................................................................................ 44 Figure 2.18: Informatica [28, 62] workflow example ................................................................................ 45 Figure 2.19: SAS ETL Studio [63-65] workflow example ........................................................................... 45 Figure 2.20: Business Objects Data Integration (data patterns detection) [66] ............................................ 46 Figure 2.21: Business Objects Data Integration (multi-user collaboration) [66] .......................................... 46 Figure 2.22: Business Objects Data Integration (impact analysis) [66] ...................................................... 47 Figure 2.23: DB Software Laboratory’s Visual Importer (scheduler) [56, 57] ............................................... 47 Figure 2.24: Sybase TransformOnDemand (scheduler) [49-51] ................................................................. 48 Figure 2.25: Informatica (management grid console) [28, 62] .................................................................. 48 Figure 2.26: DB Software Laboratory’s Visual Importer (text wizard) [56, 57] ............................................. 49 Figure 2.27: Sybase TransformOnDemand (text data provider wizard) [49-51] ........................................... 49

- XXI-

Figure 2.28: SAS ETL (text wizard) [63-65] ............................................................................................ 50 Figure 3.1: Abstract architecture of a data warehouse.............................................................................. 52 Figure 3.2: ETL classical pipeline ........................................................................................................... 53 Figure 3.3: ETD + IL pipeline ................................................................................................................ 54 Figure 3.4: IL pipeline .......................................................................................................................... 55 Figure 4.1: Technology knowledge model ............................................................................................... 68 Figure 4.2: Data Processing Module architecture ..................................................................................... 69 Figure 4.3: FR versus ETD architecture .................................................................................................. 70 Figure 4.4: ETD versus Data Delivery architecture ................................................................................... 71 Figure 4.5: Load balancing architecture .................................................................................................. 72 Figure 4.6: FR Engine actions after being launched .................................................................................. 74 Figure 4.7: Scheduler actions................................................................................................................ 75 Figure 4.8: Data Service Provider Dispatcher actions ............................................................................... 76 Figure 4.9: Data Service Provider schema .............................................................................................. 77 Figure 4.10: Connection element ........................................................................................................... 78 Figure 4.11: Provided File schema ......................................................................................................... 79 Figure 4.12: Source / File element (up) and Source / Database element (down).......................................... 80 Figure 4.13: WebMethod element .......................................................................................................... 80 Figure 4.14: Example of a FTP Query ..................................................................................................... 81 Figure 4.15: Example of a Binary Program Query .................................................................................... 81 Figure 4.16: Example of a Database Query ............................................................................................. 81 Figure 4.17: Example of a Web Service Query......................................................................................... 82 Figure 4.18: ScheduleOptions element ................................................................................................... 82 Figure 4.19: Routing element................................................................................................................ 83 Figure 4.20: ETD Engine input / output data flow .................................................................................... 83 Figure 4.21: Data Delivery package size ................................................................................................. 84 Figure 4.22: ETD Engine tasks pipeline................................................................................................... 85 Figure 4.23: Generic Data Delivery schema ............................................................................................ 87 Figure 4.24: Example of Data element contents ...................................................................................... 88 Figure 4.25: DPM HMI interaction with FR and ETD engines ...................................................................... 89 Figure 4.26: DPM HMI logging subscription mechanism ............................................................................ 89 Figure 4.27: DPM Console - Data Service Provider metadata ..................................................................... 90

- XXII -

Figure 4.28: DPM Console – logging area................................................................................................ 91 Figure 4.29: Toolbars and filtering area .................................................................................................. 92 Figure 5.1: The File Format Definition model ........................................................................................... 95 Figure 5.2: General assumptions ........................................................................................................... 96 Figure 5.3: Section definition ................................................................................................................ 97 Figure 5.4: Sectioning algorithm............................................................................................................ 97 Figure 5.5: The FileFormatDefinition root element ................................................................................... 99 Figure 5.6: The Delimited element ....................................................................................................... 100 Figure 5.7: The SingleValue element .................................................................................................... 101 Figure 5.8: The Table element............................................................................................................. 102 Figure 5.9: The Validation element ...................................................................................................... 103 Figure 5.10: The Transformation element ............................................................................................. 104 Figure 5.11: The DataDelivery element ................................................................................................ 104 Figure 5.12: The Template element ..................................................................................................... 105 Figure 5.13: The Identifier element...................................................................................................... 105 Figure 5.14: The MappingIdentifier element .......................................................................................... 106 Figure 5.15: The GraphicalDisplay element ........................................................................................... 107 Figure 5.16: Transformation definition.................................................................................................. 108 Figure 5.17: The FFD Editor ETD tabs................................................................................................... 109 Figure 5.18: A graphical layout............................................................................................................ 109 Figure 5.19: Extract tab layout ............................................................................................................ 110 Figure 5.20: Transform tab layout ....................................................................................................... 111 Figure 5.21: Data Delivery tab layout................................................................................................... 112 Figure 5.22: FFD File menu ................................................................................................................. 112 Figure 5.23: Open FFD… options.......................................................................................................... 113 Figure 5.24: File Format Definition Metadata form ................................................................................. 114 Figure 5.25: Marking a text area for section creation.............................................................................. 115 Figure 5.26: Default section creation.................................................................................................... 116 Figure 5.27: Interacting with section delimiters ..................................................................................... 117 Figure 5.28: Transforming a line oriented delimiter into relative .............................................................. 117 Figure 5.29: A relative to previous section start delimiter ....................................................................... 117 Figure 5.30: Transforming a relative delimiter into line oriented .............................................................. 117

- XXIII-

Figure 5.31: Defining a section Start Delimiter based on a string pattern.................................................. 118 Figure 5.32: Defining a Contiguous Section based on a string pattern ...................................................... 118 Figure 5.33: Pattern Definition form..................................................................................................... 118 Figure 5.34: Applying a text pattern (left) or advanced regular expression (right) to an input file text ......... 119 Figure 5.35: Contiguous Section after applying a simple text pattern ....................................................... 119 Figure 5.36: Data types and validation rules (single value wizard) ........................................................... 120 Figure 5.37: Single Value Wizard – Defining the single value type ........................................................... 121 Figure 5.38: Single Value Wizard – Prefix based (left) or prefix and suffix based (right) single value............ 121 Figure 5.39: Single value representation .............................................................................................. 122 Figure 5.40: Table Wizard – Defining the table type ............................................................................... 122 Figure 5.41: Defining a character delimited table................................................................................... 123 Figure 5.42: Table Wizard – Defining a fix width table with 3 columns (left) and defining a regular expression table (right).............................................................................................................................. 123 Figure 5.43: Selecting a Table node after creating a table field................................................................ 124 Figure 5.44: Selecting a Column Table ................................................................................................. 124 Figure 5.45: Mapping By Content sectioning definitions to regular expressions .......................................... 125 Figure 5.46: Mapping a single value definition with PREFIX and SUFFIX to a regular expression .................. 125 Figure 5.47: Mapping a character delimited table to a regular expression ................................................. 125 Figure 5.48: Mapping a fixed width table to a regular expression ............................................................. 125 Figure 5.49: Regular Expression Library concept ................................................................................... 126 Figure 5.50: Regular Expression Library (contiguous section wizard)........................................................ 126 Figure 5.51: FFD Editor’s Transformation step....................................................................................... 127 Figure 5.52: FFD Editor's Data Delivery step ......................................................................................... 128 Figure 5.53: Identifier picker form ....................................................................................................... 129 Figure 5.54: An empty template (SW parameters) with a parameter identifier defined ............................... 130 Figure 5.55: Comparing the sample file with the test set files.................................................................. 131 Figure 6.1: Part of text file example containing stock information [77] ..................................................... 135 Figure 6.2: Part of text file example containing exchange rates [3].......................................................... 135 Figure 6.3: Part of text file containing earthquakes occurrences data [78] ................................................ 136 Figure 6.4: Part of text file containing volcano daily alerts [79] ............................................................... 136 Figure 6.5: Part of a text file example of solar activity events [2] ............................................................ 137 Figure 6.6: A text file example of flare, magnetic and proton forecasts [2] ............................................... 137 Figure 6.7: Part of a text file example of potentially hazardous asteroids [80] ........................................... 138

- XXIV -

Figure 6.8: Galileo cluster (artist's impression)...................................................................................... 139 Figure 6.9: Galileo spacecraft prototypes in orbit (artist's impression): GIOVE-A (left side) and GIOVE-B (right side) ........................................................................................................................................ 140 Figure 6.10: SESS Common and Mission infrastructure interaction (Mission perspective) ............................ 141 Figure 6.11: Generic SESS infrastructure.............................................................................................. 142 Figure 6.12: Section frequency per input file ......................................................................................... 145 Figure 6.13: Field frequency per input file............................................................................................. 145 Figure 6.14: Transformation frequency per input file .............................................................................. 146 Figure 6.15: Frequency of identifiers per input file ................................................................................. 146

- XXV-

Index of Tables Table 1.1: List of acronyms................................................................................................................. XIII Table 4.1: Data Delivery format for a single EID...................................................................................... 86 Table 4.2: Data Delivery format for multiple EID (mapping)...................................................................... 86 Table 5.1: Thread priority definition example ........................................................................................ 106 Table 6.1: SESS Data Service Providers and associated Provided Files ..................................................... 144 Table 8.1: DPM global requirements .................................................................................................... 163 Table 8.2: ETD Engine requirements .................................................................................................... 163 Table 8.3: FFD Editor requirements ..................................................................................................... 164 Table 8.4: FR requirements ................................................................................................................ 164 Table 8.5: Available transformation operations...................................................................................... 166 Table 8.6: SESS File Format Definition statistics .................................................................................... 168

- XXVI -

Chapter 1 Introduction This chapter introduces the ETL problematic for dealing with semi-structured text files and provides a first motivation for a novel approach to ETL that separates domain expertise from computer expertise. SEIS and SESS space domain systems are presented, where the author carried out his first activities in the ETL domain and where the novel approach to ETL has been developed and validated. The thesis is briefly described, focusing on its goals / requirements and expected author’s contributions for the ETL community. Finally, the report’s structure is presented, as well as the used conventions.

-1-

Introduction ETL stands for Extraction, Transformation and Loading of data from a data source to a normalized data target, usually applied to the data warehousing / integration domains [1]. In a recent past the ETL problematic has mainly focused on database sources, but currently a secondary data source – textual data – is emerging, becoming a relevant data source by itself, instead of a mere supporting data source. This change has been mainly motivated by the continuous evolution of the World Wide Web (WWW), a major repository of textual information, that is organized and presented in such a way that is human readable and easily understood. Scientific data is an important subset of textual data, organized in a semi-structured format (e.g. tabular format). Space environment data 1 is one example of such data that is available in multiple Hyper Text Transfer Protocol (HTTP) and File Transfer Protocol (FTP) servers like the National Oceanic & Atmospheric Administration / National Geophysical Data Centre (NOAA/NGDC) [2]. Currently, no ETL solution exists, so that a domain expert without computer-science expertise, can use in an intuitive way for managing automatic retrieval and data extraction from text files.

1.1 Semi-Structured Text Data The World Wide Web appears as a major source of textual data, referring to multiple domains, some of them highly complex, ranging from stock market values (Figure 1.1) to solar activity (Figure 1.2) and physical measures (Figure 1.3). 1999-1-4;1.91;-;1.8004;1.6168;0.58231;35.107;7.4501;15.6466;0.7111;327.15;-;251.48;-;133.73;-;-;-;;8.855;2.2229;4.0712;-;9.4696;-;189.045;-;-;1.1789;6.9358;1.4238;110.265;9.4067;244.383;1.19679;-;-;-;-;-;;-; 1999-1-5;1.8944;-;1.7965;1.6123;0.5823;34.917;7.4495;15.6466;0.7122;324.7;-;250.8;-;130.96;-;-;-;;8.7745;2.2011;4.0245;-;9.4025;-;188.775;-;-;1.179;6.7975;1.4242;110.265;9.4077;242.809;1.20125;-;-;-;-;-;;-; 1999-1-6;1.882;-;1.7711;1.6116;0.582;34.85;7.4452;15.6466;0.7076;324.72;-;250.67;-;131.42;-;-;-;;8.7335;2.189;4.0065;-;9.305;-;188.7;-;-;1.1743;6.7307;1.4204;110.265;9.3712;244.258;1.20388;-;-;-;-;-;-;-; 1999-1-7;1.8474;-;1.7602;1.6165;0.58187;34.886;7.4431;15.6466;0.70585;324.4;-;250.09;-;129.43;-;-;-;;8.6295;2.1531;4.0165;-;9.18;-;188.8;-;-;1.1632;6.8283;1.4074;110.265;9.2831;247.089;1.21273;-;-;-;-;-;-;-;

Figure 1.1: Part of a text file example containing exchange rates data [3] These files, containing textual data, follow a human-readable semi-structured format. The term semi-structured refers to the capability to organize and present information, highlighting the different types of data available in a file, e.g. descriptive metadata area, informative header, disclaimer, remarks associated to the data area, numeric values or final remarks.

1

Space environment data is introduced in this chapter since the thesis background and

the main case study (where the proposed thesis has been applied) refer to systems from the space domain. -2-

Semi-Structured Text Data Commonly, text files are made available by diverse Data Service Providers (DSP) external organizations following their internal priorities, funding allocation and even individual good-will. This results in non-normalized file formats, not obeying any standard besides a possible local one used by each individual provider. Comparing the file structure of Figure 1.1 and Figure 1.2, one can see that while Figure 1.2 has well delimited areas with text file metadata (prefixed by the “:” character), description about the file contents (prefixed by the “#” character) and data contents (the remaining of the file), the file structure depicted in Figure 1.1 does not follow any of these syntactic rules. :Product: 20050427events.txt :Created: 2005 Apr 28 0302 UT :Date: 2005 04 27 # Prepared by the U.S. Dept. of Commerce, NOAA, Space Environment Center. # Please send comments and suggestions to [email protected] # # Missing data: //// # Updated every 30 minutes. # Edited Events for 2005 Apr 27 # #Event Begin Max End Obs Q Type Loc/Frq Particulars Reg# #------------------------------------------------------------------------------5170

0407

////

0409

LEA

C

RSP

065-136

III/1

5180 +

0452

////

0452

SVI

C

RSP

032-075

III/1

5190

0641

////

0648

SVI

C

RSP

029-076

III/1

5200

1004

1008

1012

G10

5

XRA

1-8A

B1.4

5210 +

1235

////

1235

SVI

C

RSP

025-050

III/1

5220 +

1418

////

1423

SVI

C

RSP

025-081

III/1

5250

1531

////

1532

SVI

C

RSP

025-075

III/1

5260

1554

////

1554

SVI

U

RSP

025-061

III/1

5270 + 5270

1914 1926

1922 1930

1934 1932

G12 G12

5 5

XRA XFL

1-8A S06E52

B2.9 3.0E-04 1.0E+02 1.7E+02

5280 +

2002

2005

2008

G12

5

XRA

1-8A

B2.0

6.3E-05

5290 +

2043

2055

2107

G12

5

XRA

1-8A

B4.3

4.5E-04

6.3E-05

0756 0756

0756

Figure 1.2: Part of a text file example of solar activity events [2] Regarding data presentation, this is arranged in multiple ways, from a complete tabular format (Figure 1.1), a sparse tabular format (Figure 1.2) or a specific non-tabular format (Figure 1.3). :Product: 0427GEOA.txt :Issued: 2005 Apr 27 0335 UTC # Prepared by the U.S. Dept. of Commerce, NOAA, # Space Environment Center. # Geoalert WWA117 UGEOA 20401 50427 0330/ 9935/ 11271 20271 30271 99999 UGEOE 20401 50427 0330/ 26/00 99999 UGEOI 20401 50427 0330/ 26/// 10020 20910 30030 40000 50000 61207 71404 80001 90550 99999 UGEOR 20401 50427 0330/ 26/24 27101 10756 20000 30400 44535 50550 60010 25506 16200 99999 PLAIN

Figure 1.3: A text file example of flare, magnetic and proton forecasts [2]

- 3-

Introduction Also, the file structure and data presentation may evolve dynamically, i.e. new parameters may be added, deleted or updated into a file, thus making the format vary in time. Notification about format change is inexistent and has to be inferred by the users.

1.2 ETL Applied to Semi-Structured Text Data Most users that intend to use data present in semi-structured text files do not have computer-science expertise. Currently these individuals are dependent from computerscience experts since most ETL tools require at some point the development of sourcecode or computer-science expertise (e.g. database schemas, XML schemas, Structured Query Language - SQL). Further, the existing ETL tools consider semi-structured files a secondary data source (sometimes even optional), since the main focus of ETL is still structured data (e.g. database, eXtended Markup Language - XML, message services and API oriented), usually specific to the data warehousing / integration domains. Due to the complexity of ETL tools (and their high prices), users without computerscience support are forced to use automatic processing mechanisms only when dealing with files with a simple structure. These files usually follow a well-defined tabular format and Microsoft (MS) Excel [4] is a common selection for data processing (even having a limited set of data processing functionalities) due to its familiarity to the user. In order to handle such tools, users may have to normalize the file structure through direct manipulation of the file contents (e.g. using any ordinary text editor) which represents a time consuming task and prone to human error.

1.3 SEIS and SESS Systems Space Environment data – commonly known as Space Weather within the space / physical domain communities – is a good example of information exchange, using semistructured text files. The space domain term Space Weather (S/W) [5, 6] can be defined as the combination of conditions on the sun, solar wind, magnetosphere, ionosphere and thermosphere. Space Weather, affects not only Earth’s environment, but specially all Spacecraft (S/C) systems orbiting the planet. Degradation of solar panels and the occurrence of Single Event Upsets (SEU) - unpredicted bit changes on the S/C onboard memories due to cosmic radiation - are two examples of SW effects. The integration of both near real time and historical S/W and S/C data for analysis is fundamental in the decision-making process during critical Spacecraft control periods and in order to extend the mission’s lifetime to its maximum. Analysis of the current solar activity together with the internal S/C sensors measures may force or prevent the execution of manoeuvres in order to protect S/C equipments or even human lives. Space Weather information is available in many public HTTP / FTP internet sites (e.g. [2]) as semi-structured text files. Spacecraft telemetry data (usually proprietary and not available to public) is also commonly distributed in text files following a semi-structured format. -4-

Thesis Overview The Space Environment Support System for Telecom and Navigation Missions (SESS) [7] is a multi-mission decision support system, capable of providing near real-time monitoring and visualization [8], in addition to offline historical analysis [9] of S/W and S/C data, events and alarms to Flight Control Teams (FCT). The main goal of the system is to provide S/C and S/W data integration to Flight Control Teams and is explained in detail in Chapter 6. This system is based on the Space Environment Information System for Mission Control Purposes (SEIS) [9] experience, a single-mission decision support system prototype that also enables SW and SC data integration. The author has participated in the development of both systems. In SEIS the author was responsible for the partial definition of metadata ETL scripts for processing input data files (relevant in the system’s scope), while in SESS, the author was responsible for the complete development of the declarative ETL solution.

1.4 Thesis Overview This section introduces the thesis Extraction and Transformation of Data from SemiStructured Text Files Using a Declarative Approach main goals and the author’s expected contributions for the ETL computer-science community. The thesis proposes a new approach to ETL, enabling a clearer separation of concerns, dividing ETL in domain tasks (ETD - Extraction, Transformation and Data Delivery) and technical tasks (IL Integration and Loading).

1.4.1 Goals The ETL solution envisaged for the SESS system followed a primary guideline that the implementation of specific source code or computer-science expertise would not be required from domain experts. Instead a declarative approach was suggested, based on different types of metadata, that the domain user should instantiate, using specific visualization tools. In this manner, the data processing pipeline could be directly used and maintained only by domain experts without computer-science expertise. The thesis’s analysis, design and implementation has occurred and been applied in the scope of the SESS system. Following this new approach to ETL (described in detail in Chapter 3) a set of thirteen high-level requirements has been derived for attaining a ready-to-use data processing solution for domain experts: o

Open Source: The solution shall be implemented using open-source technologies, presented as a no acquisition cost solution, accessible to anyone. Furthermore, the solution shall be developed using software independent from the operating system;

o

Completeness: A full data processing solution shall be available comprising data retrieval, data processing and overall management of the data processing solution;

o

Separation of Concerns: The domain user shall be able to use and maintain the data processing pipeline without requiring computer-science expertise. All domain procedures and definitions shall be represented recurring to a high-level declarative

- 5-

Introduction language. No specific source-code shall be required to implement the processing of a single text file; o

User Friendliness: A graphical application shall be available, making use of the declarative language in a transparent way to the end user;

o

Performance: Data retrieval and data processing shall have a reduced response time while preserving both CPU and network bandwidth resources;

o

Scalability: Both data retrieval and data processing must be capable of handling multiple simultaneous downloads and processing requests, respectively;

o

Modularity: The solution architecture and implementation shall be as modular as possible, clearly separating domain from technical tasks. Further, there shall be a clear separation between logic and presentation layers, easing future maintenance tasks;

o

Reusability: System modules shall be designed and implemented focusing on reutilization as much as possible. Such approach shall be applied for factoring common behaviour / functionalities within the data processing solution itself or for reusing entirely / partially system components in the solution of other problems;

o

Metadata Driven: The data processing solution shall be metadata driven, which means that all processes for executing and managing the data retrieval and ETD pipeline are based on metadata;

o

Correctness: Data typing facilities and validation rules shall be available during the entire ETD process, in order for the outcome to be valid. These data quality mechanisms shall be applied iteratively in the Extraction, Transformation and Data Delivery steps;

o

Data Traceability: It shall be possible to trace-back a processed datum value, to the originally downloaded file;

o

Fault Tolerance: In case of failure during download, the recovery of the failed file shall be retried. If an error occurs during the data processing the administrator must be notified and other data processing operations shall be resumed;

o

Validation: After performing the ETD specifications based on a primary input file, the FFD generality shall be tested with a larger set of text files belonging to the same class of files.

1.4.2 Contributions With this thesis, the author’s expected contributions for the ETL community are classified within three categories: o

Rethink ETL: This thesis presents a new paradigm to ETL that separates ETD domain-expertise tasks from computer-science IL tasks, such that ETL = ETD + IL. Such approach, envisages the creation of a declarative language for representing ETD statements to be applied to input text files and a graphical tool that enables an easy

-6-

Thesis Structure manipulation of the declarative language, making it transparent to a domain user (without computer-science expertise). This new approach does not intend to trigger a revolution in the current ETL paradigm. Instead, it provides a localized contribution, expecting to ease the data processing process for semi-structured data, available in text files, using a specialized tool suite, which can be effectively handled by common non-expert users; o

Propose a complete architecture for ETD+IL: The proposed architecture shall enable a complete data processing solution based on ETD. Besides ETD supporting applications, such solution shall comprise an engine for file download (also based on declarative assertions) and management tools that enable the control, visualization and data traceability of the processing pipeline execution;

o

Implementation and validation of the proposed architecture and tools: Finally, the declarative data processing module shall be implemented, integrated, tested and validated in the scope of a real operational system (not a prototype application).

Special

attention

shall

be

taken

to

the

graphical

application

implementation that interacts with the domain user for defining ETD metadata scripts without resource to computer-science expertise. Further, a state of the art survey shall be conducted evaluating the current ETL trends within the research, open-source and commercial domains for the most relevant applications.

1.5 Thesis Structure The contents of this thesis are structured in eight chapters: This chapter introduces the ETL problematic for dealing with semistructured text files and provides a first motivation for a novel approach to ETL that separates domain expertise from computer expertise. Chapter One

SEIS and SESS space domain systems are presented, where the author carried out his first activities in the ETL domain and where the novel approach to ETL has been developed and validated. The thesis is briefly described, focusing on its goals / requirements and expected author’s contributions for the ETL community. Finally, the report’s structure is presented, as well as the used conventions.

This chapter focuses on the state of the art for the ETL domain. Chapter

First, the current trends on ETL conceptual representation and

Two

framework are presented, followed by a historical presentation on data integration architectures. Next, the most current approaches to data processing (i.e. consolidation, federation, propagation) are - 7-

Introduction described, as well as hybrid approaches. Follows an explanation about data integration technologies, their advantages and disadvantages. An explanation about the usage of metadata for describing ETL statements is provided as well as an evaluation of the proposed / existing standards. Due to the relevance of ETL tools some external surveys are referenced that provide an evaluation for them. For the most relevant tools a report has been conducted by the author for the research, open source and commercial domains. The SESS system is particularly highlighted due to the author’s participation. Finally, some conclusions and remarks are provided, summarizing the current state of the art for the ETL domain.

Focusing on a novel approach for ETL, this chapter proposes a clear separation

of

domain

from

technological

concerns,

such

that

ETL = ETD + IL. Chapter Three

First the classical ETL approach is described, analysed and evaluated in the scope of semi-structured scientific data. Then the ETD+IL approach is explained, describing specifically which are ETD and IL actions. Finally, a set of requirements is derived for accomplishing a complete data retrieval and processing solution.

This chapter presents a complete Data Processing solution based on the proposed ETD+IL approach. First, the main technologies involved in the construction of the data Chapter Four

processing solution are introduced, as well as how they have been weaved together. Follows a discussion regarding the solution’s architectural design. Then, each individual component of the Data Processing solution is described. Depending on the component’s complexity, its internal data flows and functionalities are explained, as well as, the core services made available to external applications (if any).

Chapter Five

The fifth chapter is dedicated to the File Format Definition (FFD) language and File Format Definition Editor (FFD Editor) graphical application. First, an abstract model for the FFD language is presented, followed

-8-

Conventions by a description on how the language has been implemented using XML-based

technologies.

Next,

the

FFD

Editor

application

is

introduced, starting with a general overview of the application’s graphical organization, followed by an explanation on how the three ETD steps are instantiated seamlessly to the domain user. Due to its complexity (derived from the data normalization process) graphical operations related with the Extract activity are explored in higher detail. Finally, some considerations are presented regarding the FFD language expressiveness and extensibility.

This chapter presents how the ETD+IL thesis has been put into practice, in a set of case studies. The presentation follows two perspectives: first the generality and versatility of the solution is explored for dealing with data from different domains. Second, it is explained how the Data Processing Module has been integrated, tested and validated in an operational system for a space domain Chapter Six

system: SESS. Special attention will be placed in this second approach, starting with an overview of the SESS system objectives and the Galileo reference mission. Then, the overall SESS architecture is described, including a summarized explanation of all the components that have not been developed in the context of this thesis. The final section describes how the ETD+IL approach has been successfully applied to SESS and provides an evaluation of its usage.

The final chapter summarizes the work described in this report, Chapter Seven

presenting an overview and evaluation of the ETD+IL conceptual approach. Overall conclusions are presented and future work in the field is proposed, pointing to an evolution for the solution herein presented.

Chapter Eight

This section comprises all the bibliographic contents referenced throughout the report.

1.6 Conventions This report follows a set of conventions either regarding text format styles, diagrams and concept terminology that will be followed as standards throughout the entire document. Such conventions are presented individually in the following sub-sections in order to provide a clearer understanding of the report contents.

- 9-

Introduction

1.6.1 Textual Notations The thesis document is divided into chapters, where each presents an individual issue (e.g. domain description, problem description or technical solution) that can be analysed in isolation. Each chapter starts with a summary page where the main contents of the chapter are described. Within each chapter a set of headings, following a hierarchical numeric notation, structures the chapter contents. Each heading comprises text that may be formatted in one of three ways: o

Regular text: A set of textual statements without a particular relevance over the others;

o

Bold text: Highlights a particular text statement, focusing the attention on it;

o

Italics text: Refers to a specific domain statement (usually technical) or to a colloquial expression.

Footnote text is also introduced whenever a complementary explanation is required, but without diverting the attention from the main text. Acronyms are widely used throughout this report due to the massive presence of scientific terms. All acronym expressions are summarized in Table 1.1 and are introduced as required in the text. The first time an acronym is referenced a full explanation shall be provided while in the remaining references only the acronym is used.

1.6.2 Uniform Modelling Language The Unified Modelling Language (UML) [10, 11] is a non-proprietary specification language for object modelling, in the field of software engineering. UML is a generalpurpose modelling language that includes a standardized graphical notation used to create an abstract model of a system, referred to as an UML model. UML is extensible, offering the stereotype mechanism for customization. Since UML is a widely accepted de facto standard by the computer-science community for diagram representation, whenever possible, diagrams presented in the scope of this thesis (mainly State Diagrams 2) shall be UML compliant.

2

State Diagrams are a finite state machine represented as a directed graph, where each

node can be mapped to a high-level computation state / operation, where connections between nodes represent a change between states. - 10 -

Chapter 2 Related Work This chapter focuses on the state of the art for the ETL domain. First, the current trends on ETL conceptual representation and framework are presented,

followed

by

a

historical

presentation

on

data

integration

architectures. Next, the most current approaches to data processing (i.e. consolidation, federation, propagation) are described, as well as hybrid approaches. Follows an explanation about data integration technologies, their advantages and disadvantages. An explanation about the usage of metadata for describing ETL statements is provided as well as an evaluation of the proposed / existing standards. Due to the relevance of ETL tools some external surveys are referenced that provide an evaluation for them. For the most relevant tools a report has been conducted by the author for the research, open source and commercial domains. The SESS system is particularly highlighted due to the author’s participation. Finally, some conclusions and remarks are provided, summarizing the current state of the art for the ETL domain.

- 11 -

Related Work The development of a new information system poses many challenges and doubts to the team responsible for its implementation. A significant part is related with the ETL component, responsible for the acquisition and normalization of data. During the requirement / design phases, commonly, five questions drive the implementation of an ETL component: 1. Can an existing Commercial off-the-shelf (COTS) solution be reused? 2. Must a custom solution be developed, specifically for this problem? 3. What is the cost (time and man-power) associated? 4. How robust shall the application be? 5. Is the application easy to maintain and extend? In order to answer the first question, a survey is usually conducted regarding the state of the art for the ETL domain. Depending on the budget associated to the information system a higher evaluation effort may be placed in research / open-source applications or in commercial applications. An alternative to the COTS approach passes by developing a custom ETL solution (usually very specific). This last approach is frequent when the ETL component must follow a strict set of requirements that are found to be too specific. The decision on using a COTS approach or developing an ETL component from scratch is also influenced according four main parameters: associated cost (third question), required robustness level (fourth question), maintainability and extensibility issues (fifth question).

2.1 The Correct ETL Tool Unfortunately, practice has shown that the choice of the correct ETL tool is underestimated, minimized and sometimes even ignored. This happens frequently, since the choice becomes not a technological but a management issue, where research and open-source are rejected due to its lack of credibility and a commercial tool is selected, having many times the associated cost as the only criterion for the evaluation. According to the Gartner ETL evaluation report [12], in-house development procedures and poor management decisions when selecting an appropriate ETL tool consume up to 70% of the resources of a information system project (e.g. a data warehouse). An example of an incorrect selection of a ETL tool in a real-world situation, taken from [13], is described next. This shows how an upgrade to an appropriate ETL component contributed greatly to cost savings in the order of $2.5 billion in a single year: There are few companies that have been more aggressive than Motorola in pursuing ebusiness. Motorola become public last year that one of its corporate-wide goals - a strategic rather than a tactical one - was to get all spending into electronic systems. But, in order to make this kind of progress, the company has had to lean heavily on a business intelligence initiative. Chet Phillips, IT director for BI at Motorola was the responsible for this initiative. "At the beginning of 2002, the procurement leaders at Motorola were given a goal to drop $2.5 - 12 -

ETL Conceptual Representation and Framework billion worth of spend out of the cost structure on the direct and indirect side, and they needed a way of looking at spend comprehensively," Phillips says. Gathering the spend in one location would provide the visibility and decision support that the procurement leaders needed; in the way of such aggregation, however, was Motorola's reliance on many different enterprise systems: three version levels of Oracle, SAP (particularly within the semiconductor organization), and Ariba on the indirect procurement side. Motorola already had an enterprise application integration tool from vendor webMethods that touched a lot of different systems, but Phillips explains how, by nature, it couldn't fit the need at hand. "EAI communicates between the different systems -- it's transactionlevel data interaction," Phillips says. To go deeper in getting the data out, Motorola got an ETL tool from BI vendor Informatica. Phillips describes the benefits of the tool. "By using its capability to pull data as opposed to requesting that source systems push data, we covered ground quickly without using intensive IT resources and we had minimal intrusion on source systems." Motorola's BI project handed the baton off to the procurement organization, which could now examine $48 billion dollars worth of spending, one million purchase orders, and six million receipts at the desired level of detail. For its part, the procurement organization has come through for the corporation. Motorola reaped $2.5 billion in cost savings last year thanks to its new e-procurement tools and processes, and expects to save more this year.

2.2 ETL Conceptual Representation and Framework Work in the area of ETL conceptual representation and methodology standardization has been limited to a few initiatives that practice has shown to be too academic, vague and / or

complex.

Thus,

despite

some

efforts,

no

ETL

conceptual

representation

or

methodology is commonly agreed among the research, open-source and commercial ETL community. In the next three sub-sections some of these standards are presented that have been partially adopted.

2.2.1 AJAX Significant work has been developed in the area of conceptual representation of ETL processes [14-16] and ETL methodology [17-19] by computer science researchers from the University of Ioannina. Both works were envisaged in order to ease the documentation and formalization effort, for ETL at the early stages of data warehousing definition (not describing technical details regarding the actual implementation of ETL tasks). A set of graphic symbols has been suggested for conceptual representation of ETL primitives like concepts, instances, transformations, relations and data flows. The same researchers proposed a general methodology for dealing with ETL processes [20] (following the proposed ETL conceptual representation), based on a two-layered design that attempts to separate the logical and physical levels.

- 13-

Related Work Using this framework any ETL program would involve two activities (Figure 2.1): 1. The design of a graph of data transformations that should be applied to the input data - logical level; 2. The design of performance heuristics that could improve the execution speed of data transformations without sacrificing accuracy - physical level.

Figure 2.1: A two-level framework (example for a library) Both the conceptual representation and methodology have been put to practice with the AJAX prototype [21] (analysed in the ETL – State of the Art [22] report). At the logical level, the main constituent of an ETL AJAX program is the specification of a data flow graph where nodes are operations of the following types: mapping, view, matching, clustering and merging, while the input and output data flows of operators are logically modelled as database relations. The design of logical operators was based on the semantics of SQL primitives extended to support a larger range of transformations. Each operator can make use of externally defined functions or algorithms, written in a Third Generation Language (3GL) programming language and then registered within the library of functions and algorithms of the tool. At the physical level, decisions can be made to speed up the execution. First, the implementation of the externally defined functions can be optimized. Second, an efficient algorithm can be selected to implement a logical operation among a set of alternative algorithms. The AJAX proposed standards for conceptual representation and framework had no impact on the remaining research / commercial tools and have not been adopted in practice. All further research using the conceptual representation and framework proposed by AJAX has been conducted by the same research team that has meanwhile enhanced the tool, although without any visible impact in the ETL domain.

- 14 -

ETL Conceptual Representation and Framework

2.2.2 Meta Object Facility A different approach for representing ETL concepts, used by most metadata-based ETL tools (at least to some extent) is provided by the abstract Meta Object Facility (MOF) [23]. MOF is a naming standard, which defines a four-layer architecture where the item belonging to a layer L is an instance of a concept item described in the above layer (i.e. L + 1). Figure 2.2 provides a parallelism between UML diagrams and Java using the MOF hierarchy as an example.

Figure 2.2: MOF layers using UML and Java as comparison A parallelism between metadata and the MOF standard can be easily derived for the definition of instance and concept terms. Concepts refer to the definition of entity types (e.g. car, person, or table), while instances are actual specifications for those entities (e.g. Audi A4, John or a round wood table). Thus, the Information Layer (M0) represents actual data (e.g. a record in a database), Model Layer (M1) contains the instance metadata, describing the Information Layer objects. The Metamodel Layer (M2) contains the definitions of the several types of concept metadata that shall be stored in the previous layer. The Meta-Metamodel Layer (M3) contains common rule definitions for all concepts (e.g. structural rules regarding concepts) enabling a normalized metadata representation. MOF was used in the definition of UML as well as other Open Management Group (OMG) standards like the Common Warehouse Metamodel (presented next). As described, the MOF layer representation is quite abstract and each research group / tool vendor instantiates its own objects following internal representation schemes and functions, providing no type of standardization. Further, no type of methodology, operation set or metadata representation has been defined or proposed for ETL as a specialization of MOF. - 15-

Related Work

2.2.3 Common Warehouse Metamodel The Common Warehouse Metamodel (CWM) [24] standard enables interoperability between tools belonging to the data warehousing and business intelligence domains, through a shared, pre-established metamodel. Data interchange is supported through XML Metadata Interchange (XMI) that acts as an independent middleware for metadata representation. Currently, XMI usage is mostly limited to Computer-Aided Software (CASE) tools for import / export operations. Through this standard, OMG expected a wide adoption among data warehousing vendors and supporting tools (e.g. from the ETL domain), enabling to share metadata across data warehousing systems more efficiently and reducing significantly maintenance costs. CWM metadata interchanging capabilities followed three main requirements: o

Common Metamodels: Common data warehouse subject areas must have the same metamodel;

o

Common Definition Language: Models must be defined in the same language;

o

Common

Interchanging

Protocol:

XMI

is

the

language

to

be

used

for

interchanging purposes. Figure 2.3 presents the main layers and associated technologies for CWM: o

Object Model: Modelling foundation on which CWM is built. Components in the layers above may inherit / reference UML definitions;

o

Foundation: General services that are shared by other packages;

o

Resource: Data models for operational data sources and data warehouses;

o

Analysis: Logical services that can be mapped onto data stores defined in the Resource layer;

o

Management: Maintenance and management services in data warehousing.

Figure 2.3: Common Warehouse Metamodel structure [24] Although CWM has been partially adopted by some software vendors such as Oracle, IBM, Hyperion and Unisys, its adoption was not widespread. Actually, some of the companies involved in the creation of the standard, such as Microsoft, do not support the standard. In the author’s opinion, companies not supporting CWM sustain their decision based on the standard’s complexity, derived from trying to represent all data- 16 -

Classical Data Integration Architectures warehousing and business intelligence features, becoming too general and loosing its practical focus on the way.

2.3 Classical Data Integration Architectures Historically, four main approaches have been followed for solving the data integration problem [25-27]: hand coding, code generators, database embedded ETL and metadata driven ETL engines.

2.3.1 Hand Coding Since the dawn of data processing, integration issues have been solved through the development of custom hand-coded programs [25-27], developed in-house or by consulting programming teams. Although this approach appears at start as a low cost solution, it quickly evolves to a costly, hard to maintain and time consuming task. This usually happens, since all the relevant knowledge is represented at the low-level source code, making it hard to understand, maintain and update, being specially prone to error during maintenance tasks. The problem is even increased for those legacy systems where the original team that developed the code is not available any more. In the middle of 1990 decade, this paradigm started to be replaced by a number of third party products (code generators and engine-based tools) from specialized vendors in data integration and ETL. Surprisingly, even though ETL tools have been developed for over 10 years and are now mature products, hand coding still persists as a significant contribution for solving transformation problems. These efforts still proliferate in many legacy environments, low-budget data migration projects or when dealing with very specific ETL scenarios. Although hand-coded ETL provides unlimited flexibility, it has an associated cost: the creation of an ETL component from scratch that may be hard to maintain and evolve in a near-future depending on the ETL component complexity.

2.3.2 Code Generators Code generators [25-27] were the first early attempt to increase data processing efficiency,

replacing

possible

inefficient

source-code

developed

manually.

Code

generation frameworks have been proposed, presenting a graphical front-end where users can map processes and data flows and then generate automatically source code (such as C or Common Business Oriented Language - COBOL) as the resulting run-time solution, that can be compiled and executed on various platforms. Generally, ETL code-generating tools can handle more complex processing than their engine-based counterparts. Compiled code is generally accepted as the fastest of solutions and also enables organizations to distribute processing across multiple platforms to optimize performance.

- 17-

Related Work Although code generators usually offer visual development environments, they are sometimes not as easy to use as engine-based tools, and can lengthen overall development times in direct comparisons with engine-based tools. Code generators were a step-up from hand-coding for developers, but this approach did not gain widespread adoption since solution requirements and Information Technology (IT) architecture complexity arose and the issues around code maintenance and inaccuracies in the generation process led to higher rather than lower costs.

2.3.3 Database Embedded ETL With its origins in the early code generators, Database Management System (DBMS) vendors have embedded ETL capabilities in their products, using the database as engine and SQL as supporting language [25-27]. Some DBMS vendors have opted to include third party ETL tools that leverage common database functionality, such as stored procedures and enhanced SQL, increasing the transformation and aggregation power. This enabled third party ETL tools to optimize performance by exploiting the parallel processing and scalability features of DBMS. Other DBMS vendors offer ETL functions that mirror features available in ETL specialist vendors. Many database vendors offer graphical development tools that exploit the ETL capabilities of their database products, competing directly with third party ETL solution providers. Database-centric ETL solutions vary considerably in quality and functionality. To some extent, these products have exposed the lack of capability of SQL and database-specific extensions (e.g., PL/SQL, stored procedures) to handle cross-platform data issues, XML data, data quality, profiling and business logic needed for enterprise data integration. Further, most organisations do not wish to be dependent on a single proprietary vendor’s engine. However, for some specific scenarios, the horsepower of the relational database can be effectively used for data integration, with better results compared to metadata driven ETL engines.

2.3.4 Metadata Driven ETL Engines Informatica [28] pioneered a new data integration approach by presenting a data server, or engine powered by open, interpreted metadata as the main driver for transformation processing [25-27]. This approach addresses complexity and meets performance needs, also enabling re-use and openness since it is metadata driven. Other ETL tool vendors have also adopted this approach since, through other types of engines and languages, becoming the current trend on ETL data processing. Many of these engine-based tools have integrated metadata repositories that can synchronize metadata from source systems, target databases and other business intelligence tools. Most of these tools automatically generate metadata at every step of the process and enforce a consistent metadata-driven methodology that developers must follow. Proprietary scripting languages are used for representing metadata, running within a generally rigid centralised ETL server. These engines use language interpreters to process ETL workflows at run-time, defined by developers in a graphical environment - 18 -

Approaches to Data Processing and stored in a meta-data repository, which the engine reads at run-time to determine how to process incoming data. This way it is possible to abstract some of the implementation issues, making data mapping graphically orientated and introduce automated ETL processes. Key advantages of this approach are: o

Domain experts without programmer expertise can use ETL tools;

o

ETL tools have connectors pre-built for most source and target systems;

o

ETL tools deliver good performance even for very large data sets;

Although the proposed approach is based on metadata-interpretation, the need for custom code is rarely eliminated. Metadata driven engines can be augmented with selected processing modules hand coded in an underlying programming language. For example, a custom CRC (Cyclic Redundancy Check) algorithm could be developed and introduced into an ETL tool if the function was not part of the core function package provided by the vendor. Another significant characteristic of an engine-based approach is that all processing takes place in the engine, not on source systems. The engine typically runs on a server machine and establishes direct connections to source systems. This architecture may raise some issues since some systems may require a great deal of flexibility in their architecture for deploying transformations and other data integration components rather than in a centralized server.

2.4

Approaches to Data Processing

Data integration is usually accomplished using one (or a composition) of the following techniques [29]: consolidation, federation and propagation, as depicted in Figure 2.4.

Figure 2.4: Data consolidation, federation and propagation

2.4.1 Data Consolidation Data Consolidation [29] gathers data from different input sources and integrates it into a single persistent data store. Centralized data can then be used either for reporting and analysis (data warehouse approach) or as a data source for external applications. - 19-

Related Work When using data consolidation, a delay or latency period is usually present, between the data entry at the source system and the data being available at the target store. Depending on business needs, this latency may range from a few minutes to several days. The term near real time is used to describe an exchange data operation with a minimum latency (usually in the minutes range). Data with zero latency is known as realtime data and is almost impossible to reach using data consolidation. Whenever the exchanged data refers to high-latency periods (e.g. more than one day), then a batch approach is applied, where data is pulled from the source systems at scheduled intervals. This pull approach commonly uses queries that take periodical snapshots of source data. Queries are able to retrieve the current version of the data, but unable to capture any internal changes that might have occurred since the last snapshot. A source value could have been updated several times during this period and these intermediate values would not be visible at the target data store. In order to detect every value change at the data source, the source system must implement some kind of logging facility (e.g. supported as a file or database table), keeping trace of every operation that might affect the data values. Using this paradigm, a batch with the data operations would be transferred and applied at the data targets, following the same order as the operations have been executed at the data source. On the other hand, when the exchanged data refers to low-latency periods (e.g. seconds) then the target data store must be updated by online data integration applications that continuously capture and push data changes occurring at the source systems to the target store. This push technique requires data change to be captured, using some form of Change Data Capture (CDC) technique. Both pull and push consolidation modes can be used together: e.g. an online push application may accumulate data changes in a staging area, which is then queried at scheduled intervals by a batch pull application. While the push model follows an event-driven approach, the pull mode gathers data on demand (Figure 2.5).

Figure 2.5: Push and pull modes of data consolidations Applications commonly use consolidated data for querying, reporting and analysis purposes. Update of consolidated data is usually not allowed due to data synchronization problems with the source systems. However, a few data integration products enable this

- 20 -

Approaches to Data Processing writing capability, providing ways to handle possible data conflicts between the updated data in the consolidated data store and the origin source systems. Data consolidation allows for large volumes of data to be transformed (restructured, reconciled, cleansed and aggregated) as it flows from source systems to the target data store. As disadvantages, this approach requires intensive computing power to support the data consolidation process, network bandwidth for transferring data and disk space required for the target data store. Data consolidation is the main approach used by data warehousing applications to build and maintain an operational data store and an enterprise data warehouse, while the ETL technology is one of the most common technologies used to support data consolidation. Besides ETL, another way of accomplishing data consolidation is by using the Enterprise Content Management (ECM) technology. Most ECM solutions put their focus on consolidation and management of unstructured data such as documents, reports and Web pages 3.

2.4.2 Data Federation Data Federation [29] enables a unified virtual view of one or more data sources. When a query is issued to this virtual view, a data federation engine distributes the query through the data sources, retrieves and integrates the resulting data according to the virtual view before outputting the results back. Data federation always pulls data ondemand basis from source systems, according to query invocation. Data transformations are performed after extracting the information from the data sources. Enterprise Information Integration (EII) is a technology that enables a federated approach. Metadata is the key element in a federated system, which is used by the federation engine to access data sources. This metadata may have different types of complexity. Simple metadata configurations may consist only of the definition of the virtual view, explaining how this is mapped into the data sources. In more complex situations, it may describe the existing data load in the data sources and which access paths should be used for access (in this way, the federated solution may greatly optimize the access to the source data). Some federated engines may use metadata even further, describing additional business rules like semantic relationships between data elements crosscutting to the source systems (e.g. customer data, where a common customer identifier may be mapped to various customer keys used in other source systems). The main advantage of the federated approach is that it provides access to data, removing the need to consolidate it into another data store, i.e. when the cost of data consolidation outweighs the business benefits it provides. Data federation can be specially useful when data security policies and license restrictions prevent source data from being copied. However, data federation is not suited for dealing with large amounts of data, when significant quality problems may be present at the data sources, or when

3

ECM will not be further discussed since ETL is the main technology for data

consolidation. - 21-

Related Work the performance impact and overhead of accessing multiple data sources at runtime becomes a performance bottleneck.

2.4.3 Data Propagation The main focus of data propagation applications [29] is to copy data from one location to another. These applications usually operate online and push data to the target location using an event-driven approach. Updates from source to target systems may be performed either asynchronously or synchronously. While synchronous propagation requires that data updates occur within the same transaction, an asynchronous propagation is independent from the update transaction at the data source. Regardless from the synchronization type, propagation guarantees the delivery of the data to the target system. Enterprise Application Integration (EAI) and Enterprise Data Replication (EDR) are two examples of technologies that support data propagation. The key advantage of data propagation is that it can be used for real-time / near-realtime data movement and can also be used for workload balancing, backup and recovery. Data propagation tools vary considerably in terms of performance, data restructuring and cleansing capabilities. Some tools may support the movement and restructuring of high volume of data, whereas EAI products are often limited in these two features. This partially happens since enterprise data replication has a data-centric architecture, whereas EAI is message or transaction-centric.

2.4.4 Hybrid Approach Data integration applications [29] may not be limited to a single data integration technique, but use a hybrid approach that involves several integration techniques. Customer Data Integration (CDI) - where the objective is to provide a harmonized view of customer information - is a good example of this approach. A simple example of CDI is a consolidated customer data store that holds customer data captured from different data sources. The information entry latency in the consolidated database will depend on whether data is consolidated online or through batch. Another possible approach to CDI the use of data federation where a virtual customer view is defined according to the data sources. This view could be used by external applications to access customer information. The federated approach may use metadata to relate customer information based on a common key. An hybrid data consolidation and data federation approach could also be possible: common customer data (e.g. name, address) could be consolidated into a single store and the remaining customer fields (e.g. customer orders), usually unique, could be federated. This scenario could be extended even further through data propagation, e.g. if a customer updates his or her name and address during a transaction, this change could be sent to the consolidated data store and then propagated to other source systems, such as a retail store customer database.

- 22 -

Approaches to Data Processing

2.4.5 Change Data Capture Both data consolidation and data propagation create (to some extent) copies of source data, requiring a way to identify and handle data changes that occur in source systems. Two approaches are common to this purpose: rebuild the target data store on a regular basis, keeping data minimally synchronized between source and target systems (which is impractical, except for small data stores) or perform some form of Change Data Capture (CDC) [29] capability. If a timestamp is available at the source data for the date of the last modification, this could be used to locate the data that has been changed since the CDC application last executed. However, unless a new record or version of the data is created at each modification, the CDC application will only be able to identify the most recent change for each individual record and not all possible changes that might have occurred between application runs. If no timestamp exists associated to the source data, then in order to enable CDC, data sources must be modified either to create a timestamp or to maintain a separate data file or message queue of data changes. CDC can be implemented through various ways. In Relational Database Management Systems (RDBMS) a common approach is to add database update triggers, that take a copy of the modified data or isolate data changes through the DBMS recovery log. Triggers may have a highly negative impact on the performance of source applications, since the trigger and the data update processing is usually performed within the same physical transaction, thus increasing the transaction latency. Processing of the recovery log, causes less impact, since this is usually an asynchronous task independent from the data update. In non-DBMS applications (e.g. document based) time stamping and versioning are quite common, which eases the CDC task. When a document is created or modified, the document metadata is usually updated to reflect the date and time of the event. Many unstructured data systems also create a new version of a document each time this is modified.

2.4.6 Data Integration Technologies As previously introduced, several technologies are available for implementing the data integration techniques described above: Extract, Transform and Load (ETL), Enterprise Information Integration (EII), Enterprise Application Integration (EAI) and Enterprise Data Replication (EDR). Follows a review of each technology.

2.4.6.1 Extract, Transform and Load (ETL) The ETL technology provides means of extracting data from source systems transforming it accordingly and loading the results into a target data store. Databases and files are the most common inputs and outputs for this technology. ETL is the main consolidation support for data integration. Data can be gathered either using a schedule-based pull mode or based on event detection. When using the pull mode, data consolidation is performed in batch, while if applying the push technique the - 23-

Related Work propagation of data changes to the target data store is performed online. Depending on the input and output data formats, data transformation may require just a few or many steps: e.g. date formatting, arithmetic operations, record restructuring, data cleansing or content aggregation. Data loading may result on a complete refresh of the target store or may be performed gradually by multiple updates at the target destination. Common interfaces for data loading are Open DataBase Connectivity (ODBC), Java DataBase Connectivity (JDBC), Java Message Service (JMS), native database and application interfaces. The first ETL solutions were limited to running batch jobs at pre-defined scheduled intervals, capturing data from file or database sources and consolidate it into a data warehouse (or relational staging area). Over the last years, a wide set of new features has been introduced, providing customization and extension to the ETL tools capabilities. Follows some significant examples: o

Multiple data sources (e.g. databases, text files, legacy data, application packages, XML files, web services, unstructured data);

o

Multiples data targets (e.g. databases, text files, web services);

o

Improved data transformation (e.g. data profiling and data quality management, standard programming languages, DBMS engine exploitation);

o

Better management (e.g. job scheduling and tracking, metadata management, error recovery);

o

Better performance (e.g. parallel processing, load balancing, caching);

o

Better visual development interfaces. 2.4.6.1.1

Tuning ETL

ETL is a traditional data integration technique widely used in information systems. However, for some specific cases, variations to standard ETL could increase performance drastically, taking advantage of RDBMS technology and special tuning features. Besides traditional ETL, two new trends exist: o

ELT (Extraction, Loading and Transformation): Oracle and Sunopsis are the leaders of this technology, where data is loaded into a staging area database and only than can transformations take place. The ELT technology has been constrained by database capabilities. Since ELT had its origins with RDBMS, the technology tended to be suitable for just one database platform. ELT also lacked functionality, as vendors were more concerned with building a database rather than an ELT tool. Sunopsis was the only exception of an ELT tool not owned by an RDBMS vendor… until it was acquired by Oracle;

o

ETLT (Extraction, Transformation, Loading and Transformation): Informatica is the leader of this technology, which consists on a database pushdown optimization to traditional ETL. This consists on a standard ETL process to a target database, where further transformations are performed (for performance reasons) before moving

- 24 -

Approaches to Data Processing information into the target tables. Microsoft SQL Server Integration Services (SSIS) also has good ETLT capabilities with the SQL Server database. Summarizing, while ELT presents itself as a novel approach that takes advantages of database optimization, ETLT can be considered as a simple extension of ETL with some tuning functionalities. Comparing both ETL and ELT [30] the following advantages can be identified regarding ELT: o

ELT leverages the RDBMS engine for scalability;

o

ELT keeps all data in the RDBMS all the time;

o

ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput;

o

ELT can achieve three to four times the throughput rates on the appropriately tuned RDBMS platform.

while the key negative points are [30]: o

ELT relies on proper database tuning and proper data model architecture;

o

ELT can easily use 100% of the hardware resources available for complex and huge operations;

o

ELT can not balance the workload;

o

ELT can not reach out to alternate systems (all data must exist in the RDBMS before ELT operations take place);

o

ELT easily increases disk storage requirements;

o

ELT can take longer to design and implement;

o

More steps (less complicated per step) but usually resulting in more SQL code.

Finally, the key advantages of ETL towards ELT are presented [30]: o

ETL can balance the workload / share the workload with the RDBMS;

o

ETL can perform more complex operations;

o

ETL can scale with separate hardware;

o

ETL can handle partitioning and parallelism independent of the data model, database layout, and source data model architecture;

o

ETL can process data in-stream, as it transfers from source to target.

while the key negative points are [30]: o

ETL requires separate and equally powerful hardware in order to scale;

o

ETL can bounce data to and from the target database, requires separate caching mechanisms, which sometimes do not scale to the magnitude of the data set.

- 25-

Related Work

2.4.6.2 Enterprise Information Integration (EII) Enterprise Information Integration [29] provides a virtual view of dispersed data, supporting the data federation approach for data integration. This view can be used for on-demand querying over transactional data, data warehouse and / or unstructured information. EII enables applications to see dispersed data sets as a single database, abstracting the complexities of retrieving data from multiple sources, heterogeneous semantics and data formats, and disparate data interfaces. EII products have evolved from two different technological backgrounds – relational DBMS and XML, but the current trend of the industry is to support both approaches, via SQL (ODBC and JDBC) and XML (XML Query Language - XQuery - and XML Path Language - XPath) data interfaces. EII products with strong DBMS background take advantage of the research performed in developing Distributed Database Management Systems (DDBMS) that has the objective of providing transparent, full read / write permissions over distributed data. A key issue in DDBMS is the performance impact over distributed processing for mission-critical applications (specially when supporting write access to distributed data). To overcome this problem, most EII products provide only read access to heterogeneous data and just a few tools allow limited update capabilities. Another important performance option is the ability of EII products to cache results and allow administrators to define rules that determine when the data in the cache is valid or needs to be refreshed.

2.4.6.3 EII versus ETL EII data federation cannot replace the traditional ETL data consolidation approach used for data warehousing, due to performance and data consistency issues of a fully federated data warehouse. Instead EII should be used to extend data warehousing to address specific needs. When using complex query processing that requires access to operational transaction systems, this may affect the performance of the operation applications running on those systems. EII increases performance in these situations by sending simpler and more specific queries to the operational systems. A potential problem with EII arises when transforming data from multiple source systems, since data relationship may be complex / confuse and the data quality may be poor, not allowing a good federated access. These issues point out the need of a more rigorous approach in the system modelling and analysis for EII. Follows a set of circumstances when EII may be a more appropriate alternative for data integration than ETL [29]: o

Direct write access to the source data: Updating a consolidated copy of the source data is generally not advisable due to data integrity issues. Some EII products enable this type of data update;

- 26 -

Approaches to Data Processing o

It is difficult to consolidate the original source data: For widely heterogeneous data and content, it may be impossible to bring all the structured and unstructured data together in a single consolidated data store;

o

Federated queries cost less than data consolidation: The cost and performance impact of using federated queries should be compared with the network, storage, and maintenance costs of using ETL to consolidate data in a single store. When the source data volumes are too large to justify consolidation, or when only a small percentage of the consolidated data is ever used, a federated solution is more appropriate.

The arguments in favour of ETL compared to EII are [29]: o

Read-only access to reasonably stable data is required: Creating regular snapshots of the data source isolates users from the ongoing changes to source data, defining a stable set of data that can be used for analysis and reporting;

o

Users need historical or trend data: Operational data sources may not have a complete history available at all times (e.g. sliding window approach). This historical can be built up over time through the ETL data consolidation process;

o

Data access performance and availability are key requirements: Users want fast access to local data for complex query processing and analysis;

o

User needs are repeatable and can be predicted in advance: When most of the performed queries are well defined, repeated in time, and require access to only a known subset of the source data, it makes sense to create a copy of the data in a consolidated data store for its manipulation;

o

Data transformation is complex: Due to performance issues it is inadvisable to perform complex data transformation as part of an EII federated query.

2.4.6.4 Enterprise Application Integration (EAI) Enterprise Application Integration [29] provides a set of standard interfaces that allow application systems to communicate and exchange business transactions, messages and data, accessing data transparently, abstracting from its location and format logic. EAI supports the data propagation approach for data integration and is usually used for real-time operational transaction processing. Access to application sources can be performed through several technologies like web services, Microsoft .NET interfaces or JMS. EAI was designed for propagating small amounts of data between applications (not supporting complex data structures handled by ETL products), either synchronously or asynchronously, within the scope of a single business transaction. If an asynchronous propagation is used, then business transactions may be broken into multiple lower-level transactions (e.g. a travel request could be broken down into airline, hotel and car reservations, although in a coordinated way).

- 27-

Related Work

2.4.6.5 EAI versus ETL EAI and ETL are not competing technologies and in many situations are used together to complement one another [29]: EAI can be a data source for ETL and ETL can be a service to EAI. The main objective of EAI is to provide transparent access to a wide set of applications. Therefore an EAI-to-ETL interface could be used to give ETL access to application data, e.g. through web service communication. Using this interface, custom point-to-point adapters for these data source applications would not be required to be developed for ETL purposes. In the opposite architectural configuration, the interface could also be used as a data target by an ETL application. Currently, most of these interfaces are still in their early stages of development and in many cases, instead of an EAI-to-ETL interface, organizations use EAI to create data files, which are then fed into the ETL application.

2.4.6.6 Enterprise Data Replication (EDR) The Enterprise Data Replication technology [29] is less known than ETL, EII or EAI, even being widely used in data integration projects. This lack of visibility happens since EDR is often packaged together with other solutions (e.g. all major DBMS vendors use data replication capabilities in their products as many CDC-based solutions that also offer data replication facilities). EDR is not limited only to data integration purposes but is also used for backup and recovery, data mirroring and workload balancing. Some EDR products, support two-way synchronous data propagation between multiple databases. Also, online data transformation is a common property of EDR tools, when data is flowing between two databases. The major difference between EDR and EAI approaches is that EDR data replication is used for transferring a considerable amount of data between databases, while EAI is designed for moving messages and transactions between applications. A hybrid approach with a data replication tool and ETL tool is very common: e.g. EDR can be used to continuously capture and transfer large data sets into a staging area and on a regular basis, this data is extracted from the staging area (by a batch tool that consolidates the data) into a data warehouse infrastructure.

2.5 Metadata for Describing ETL Statements The metadata driven ETL design is the most common data processing architecture among ETL tools. The key feature in this design is the metadata definition that provides a language for expressing ETL statements and when instantiated, instructs which operations shall be performed by the ETL engine. The structure and semantics of this language does not follow any specific standard, since none has been commonly accepted so far for describing ETL tasks. Depending on the type of ETL tool (research, open-source or commercial) one of four approaches are usually followed for the representation of ETL metadata (or a combination of them): 1. Proprietary metadata: Private metadata definition (e.g. binary) that is not made available to the public domain. Usually used in the ETL commercial domain; - 28 -

Metadata for Describing ETL Statements 2. Specific XML-based language: Defines the structure and semantics through an XML Schema or Document Type Definition (DTD). The language is defined according to the ETL engine specific functionalities. Usually, each research and open-source ETL tools follow their own specific language, completely different from one another either in terms of structure or semantics; 3. XML-based Activity Definition Language (XADL) [31, 32]: An XML language for data warehouse processes, on the basis of a well-defined DTD; 4. Simple Activity Definition Language (SADL) [31, 32]: A declarative definition language motivated from the SQL paradigm. Since the first approach cannot be discussed given that it is unknown and the second one can assume almost any structure and specification, these will not be further addressed. Regarding the XADL and SADL approaches, although having not being defined specifically for supporting ETL assertions, they will be explained, according to the following example set of data transformations: 1. Push data from table LINEITEM of source database S to table LINEITEM of the Data Warehouse (DW) database; 2. Perform a referential integrity violation checking for the foreign key of table LINEITEM in database DW that is referencing table ORDER. Delete any violating rows; 3. Perform a primary key violation check to the table LINEITEM. Report violating rows to a file. Figure 2.6 depicts a subset of the XADL definition for this scenario.

Figure 2.6: Part of the scenario expressed with XADL - 29-

Related Work In lines 3–10 the connection instructions are given for the source database (the data warehouse database is described similarly). Line 4 describes the Uniform Resource Locator (URL) of the source database. Line 8 presents the class name for the employed JDBC drive, which communicates with an instance of a DBMS through the DBMSFin driver. Lines 67–102 describe the second activity of the scenario. First, in lines 68–85 the structure of the input table is given. Lines 86–92 describe the error type (i.e., the functionality) of the activity: declare that all rows that violate the foreign key constraint should be deleted. The target column and table are specifically described. Lines 93–95 deal with the policy followed for the identified records and declare that in this case, they should be deleted. A quality factor returning the absolute number of violating rows is described in lines 96–98. This quality factor is characterized by the SQL query of line 97, which computes its value and the report file where this value should be stored. Four definition statements compose the SADL language (Figure 2.7): o

CREATE SCENARIO: Specifies the details of scenario (ties all other statements together);

o

CREATE CONNECTION: Specifies the details of each database connection;

o

CREATE ACTIVITY: Specifies an activity;

o

CREATE QUALITY FACTOR: Specifies a quality factor for a particular activity.

Figure 2.7 presents the syntax for the main statements.

Figure 2.7: The syntax of SADL Returning to the scenario example, Figure 2.8 presents a representation using SADL.

Figure 2.8: Part of the scenario expressed with SADL - 30 -

ETL Market Analysis Lines 1–4 define the scenario, which consists of three activities. The connection characteristics for connecting to the data warehouse are declared in lines 6–9. An example of the SADL description of an activity can be seen in lines 11–16 for the reference violation checking activity. Finally, lines 18–22 express the declaration of a quality factor, which counts the number of the rows that do not pass the foreign key violation check. The quality factor is traced into a log file. SADL is rather verbose and complex to write compared to XADL. Yet, it is more comprehensible since it is quite detailed in its appearance and produces programs that are easily understandable even for a non-expert. SADL is also more compact and resembles SQL, making itself more suitable for a trained designer.

2.6 ETL Market Analysis The ETL market comprises multiple tools for the design and population of data warehouses, data marts and operational data stores. Most of these tools enable a periodical extraction, transformation, and integration of data from any number of heterogeneous

data

sources

(frequently

transaction

databases)

into

time-based

databases used predominantly for query and reporting purposes. It is usual for these tools to provide developers with an interface for designing source-to-target mappings, transformation and handling metadata. This section presents two independent surveys conducted on April 2004 and May 2005 by METAspectrum [33] and Gartner [34] market analysis enterprises, respectively. The surveys only refer to ETL commercial applications and no research or open source ETL applications have been analysed on them. Since the information held in these reports is proprietary (1700€ per report copy), full contents are not available to the public and no information regarding the 2006 ETL survey could be found. The presentation of the market surveys will follow a chronological order, starting with an explanation of the analysis criteria followed by an overview of the survey findings.

2.6.1 METAspectrum Market Summary The market survey performed by METAspectrum [33] on April 2004 followed a set of seven criteria for evaluating ETL tools 4: o

Platform Support: Support for enterprise’s existing sources, targets, and execution environments is fundamental. Increasingly, support for non-DBMS sources (e.g. web services, log files) are also becoming critical concerns;

o

Transformations: Developers require both a broad palette of selectable data transformations and flexibility in developing and incorporating new logic;

4

Only a short description of the evaluating criteria and an overall evaluation of the ETL

products has been made public for this report. Evaluation values for each of the criterion have not been disclosed. - 31-

Related Work o

Data Management Utilities: Going outside the tool for high-performance sorting, job scheduling, and data transport can be a nuisance and maintenance headache;

o

Performance Characteristics: Data has to be integrated faster and batch windows are progressively shrinking;

o

Developer Environment Features: Graphical User Interface (GUI) features and flexibility, multi developer capabilities, code debugging and application versioning are useful capabilities;

o

Metadata: Support for metadata sources and interchange standards (e.g. CWM, XMI);

o

Viability: Even near-term metadata standards do not provide for the porting of ETL applications. Enterprises should be concerned with long-term support.

Figure 2.9 depicts a graphical representation of the evaluation of ETL tools. The graphic representation contains two axes (tools performance versus market presence). The market applications have been grouped according three clusters: follower, challenger and leader applications.

Figure 2.9: METAspectrum evaluation [33] Market leaders have stable, mature products with a broad array of data sourcing and targeting options, seamless access to mainframe data, robust developer environments and job parallelization. They have also leveraged strong financial cushions, enabling them to innovate and acquire ancillary capabilities (e.g. data quality). Many challengers in the ETL market offer built-in intelligence features that help speed the mapping of data sources, tuning of jobs or handling of errors during runtime. Others offer enormous libraries of built-in transformations or conjoined BI capabilities.

- 32 -

ETL Market Analysis Followers in this market are those that have an existing installed base for service and maintenance fees as they search for an acquirer or new home for their technology. Some have chosen to specialize in unpopular or vertical industry data sources. Others are nudging their way into the ETL marketplace with alternate data integration paradigms.

2.6.2 Gartner Market Summary The market survey performed by Gartner [34] on May 2005 followed a set of eleven criterion for evaluating ETL tools 5: o

Ease

of

Deployment

and

Use:

Implementation,

configuration,

design

and

development productivity; o

Breadth of Data Source and Target Support: Connectivity to a range of database types, applications and other infrastructure components;

o

Richness of Transformation and Integration Capabilities: Support for a variety of transformation types and ability to handle complexity in transforming and merging data from multiple sources;

o

Performance and Scalability: Ability to process large data volumes and support the needs of large enterprises in a timely and cost-effective manner;

o

Metadata: Discovery, audit, lineage, impact analysis, interoperability with other tools;

o

Vendor Viability and Overall Execution: Vendor focus, financials, innovation, partnerships, pricing, support capabilities and breadth of customer references;

o

Data Quality Functionality: Data quality analysis, matching, standardization, cleansing and monitoring of data quality;

o

Service Orientation: Ability to deploy data integration functionality as a service and consume other services as a source of data;

o

Real-time and Event-driven Capabilities: Support for real-time data sources such as message queues, low-latency data delivery, CDC;

o

Portability: Seamless deployment across multiple platforms, distributed and mainframe;

o

Breadth of vision: Degree to which the vendor acknowledges and supports data integration patterns beyond traditional ETL for Business Intelligence (BI) and data warehousing.

The Magic Quadrant graphic (Figure 2.10) is supported on two axes (ability to execute versus completeness of vision). This divides vendors into four brackets: leaders (big on vision and execution), challengers (big on execution, less big on vision), visionaries (big

5

Similar to the METAspectrum survey, individual evaluation values for each of the

criterion have not been disclosed. - 33-

Related Work on vision, not as good at execution) and niche players (short on both vision and execution).

Figure 2.10: Magic quadrant [34] for extraction, transformation and loading

2.7 ETL – State of the Art Report The first step on selecting an ETL tool consists on determining the current state of the art. Unfortunately, existing ETL surveys are affected by one or more of the following drawbacks: o

Incomplete: ETL surveys as [35], only refer to commercial tools. Research and open-source initiatives are not taken into consideration in these surveys;

o

Non-extensive: Only a limited number of surveys exist that correlate more than one ETL tool. For these, however, the number of ETL tools is rather limited and usually only refers to the top three-four market-leaders;

o

Biased: Multiple evaluations for ETL tools exist sponsored by individual or consortiums of ETL vendors. These evaluations (usually white papers) are rather biased to the ETL vendors’ software and cannot be considered reliable;

o

Expensive: The ETL Survey for 2006-2007 performed by ETL Tools [35], an independent company within the ETL domain, costs around 1700€ and is not open to public.

- 34 -

ETL – State of the Art Report In order to provide an independent and public solution for the previously mentioned drawbacks (at least to some extend 6) the author has conducted a report describing the state of the art for the ETL domain [22]. Due to its length (around 200 pages) the complete document was not directly included in this report, just the information regarding the ETL background, data integration approaches and technologies, as part of the current chapter. All specific information regarding the ETL applications that were analysed is available uniquely at the report. The ETL – State of the Art report [22] covers the following theoretical / business aspects of ETL: o

ETL conceptual representation and framework;

o

Classical data integration architectures, comprised by (i) Hand Coding, (ii) Code Generators, (iii) Database Embedded ETL and (iv) Metadata Driven ETL Engines;

o

Approaches to data processing, comprised by (i) Data Consolidation, (ii) Data Federation, (iii) Data Propagation, (iv) Change Data Capture and (v) Hybrid Approaches;

o

Data Integration Technologies, comprised by (i) ETL, ELT and ETLT, (ii) Enterprise Information Integration (EII), (iii) Enterprise Application Integration (EAI) and (iv) Enterprise Data Replication (EDR);

o

Metadata languages for describing ETL statements.

Further, an individual analysis of ETL applications has been performed according to three domains: research / academic, open-source and commercial. Follows a list with all the applications described in the report, grouped by their development domain: o

Research ETL Tools: AJAX, ARKTOS, Clio, DATAMOLD, IBHIS, IBIS, InFuse, INTELLICLEAN, NoDoSe and Potter's Wheel;

o

Open

Source

ETL

Tools:

Enhydra

Octopus,

Jitterbit,

KETL,

Pentaho

Data

Integration: Kettle Project, Pequel ETL Engine and Talend Open Studio; o

Commercial ETL Tools: Business Objects Data Integrator, Cognos DecisionStream, DataMirror Transformation Server, DB Software Laboratory's Visual Importer Pro, DENODO, Embarcadero Technologies DT/Studio, ETI Solution v5, ETL Solutions Transformation Manager, Group1 Data Flow, Hummingbird Genio, IBM Websphere Datastage, Informatica PowerCenter, IWay Data Migrator, Microsoft SQL Server 2005, Oracle Warehouse Builder, Pervasive Data Integrator, SAS ETL, Stylus Studio, Sunopsis Data Conductor, Sybase TransformOnDemand.

6

Some of the applications could not be properly evaluated due to the shortage of

technical information and / or unavailability of software for public usage. - 35-

Related Work

2.8 Space Environment Control Purposes

Information

System

for

Mission

The space environment and its effects are being progressively taking into account in spacecraft manufacture from the earliest design phases until reaching the operational state. Most of these space effects on the spacecraft (long-term effects or temporal effects) can be predicted with acceptable accuracy thanks to space environment models and tools developed for this purpose at design time. This is the case of the Space Environment Information System (SPENVIS) [36], which provides access to several models and tools to produce a space environment specification for any space mission (e.g. particle fluxes, atomic oxygen degradation). On the other hand, and during the operational phase of the spacecraft, some anomalies due to space environment effects or to unpredictable space weather events can occur and affect spacecraft behaviour. These anomalies are mainly originated by the solar activity (e.g. Solar Protons Events, Coronal Mass Ejections), whose are not evaluated during the system design with the same level of accuracy as the effects produced by the well-known space environment. Solar events and its effects are predicted with difficulty, and the spacecraft anomalies caused by the space environment are not always assigned to it, due to the lack of proper operational tools that are able to integrate and correlate space environment information and housekeeping data of the spacecraft simultaneously. Scientific and navigation spacecrafts, orbiting in Medium Earth Orbit (MEO), are a good example on how the space environment models, may not be as realistic as in other orbits due to the high variation of the environment at this altitude. The continuous operation of the payload on these spacecrafts is a critical issue due the nature of the supplied products. Access to the space environment and space weather databases, together with a deep knowledge of the space environment design, i.e. spacecraft shielding information, radiation testing data, house-keeping telemetry designed to monitor the behaviour of the spacecraft systems versus the space environment effects, will help to increase the lifetime of spacecraft missions and improve the construction of the next generation of spacecrafts. Such data integration systems require both real-time and historical data from multiple sources (possibly with different formats) that must be correlated by a space-domain expert, through visual inspection, using monitoring and reporting tools. This is the main principle of the SEIS system 7, developed for the European Space Agency (ESA) [37] by UNINOVA [38] as prime contractor and DEIMOS Engenharia [39] as subcontractor. The author participated in the system’s implementation as UNINOVA team member and was responsible for the partial definition of metadata ETL scripts for processing input data files that were found relevant in the system’s scope.

7

This is also the main principle for the SESS system.

- 36 -

Space Environment Information System for Mission Control Purposes

2.8.1 Objectives SEIS is a multi-mission decision support system, capable of providing near real-time monitoring [8] and visualization (in addition to offline historical analysis [9]) of space weather and spacecraft data, events and alarms to FCTs responsibles of the International Gamma-Ray Astrophysics Laboratory (Integral), Environmental Satellite (Envisat) and XRay Multi-Mission (XMM) satellites. Since the Integral S/C has been selected as reference mission, all SEIS services – offline and online – will be available, while Envisat and XMM teams will only benefit from a fraction of all the services available for the Integral mission. The following list outlines SEIS’s core services: o

Reliable Space Weather and Spacecraft data integration;

o

Inclusion of Space Weather and Space Weather effects estimations generated by a widely accepted collection of physical Space Weather models;

o

Near real-time alarm triggered events, based on rules extracted from the Flight Operations’ Plan (FOP) which capture users’ domain knowledge;

o

Near real-time visualization of ongoing Space Weather and Spacecraft conditions through the SEIS Monitoring Tool (MT) [8];

o

Historical data visualization and correlation analysis (including automatic report design, generation and browsing) using state-of-art On-Line Analytical Processing (OLAP) client/server technology - SEIS Reporting and Analysis Tool (RAT) [9].

2.8.2 Architecture In order to provide users with the previously mentioned set of services, the system architecture depicted in Figure 2.11 was envisaged, which is divided d in several modules according to their specific roles: o

Data Processing Module: Responsible for file retrieval, parameter extraction and further transformations applied to all identified data, ensuring it meets the online and offline availability constraints, whilst having reusability and maintainability issues in mind (further detailed on in the following section);

o

Data Integration Module (DIM): Acts as the system’s supporting infrastructure database, providing integrated data services to the SEIS client applications, using three multi-purpose databases: Data Warehouse (DW) [27], Operational Data Storage (ODS) and Data Mart (DM);

o

Forecasting Module (3M): A collection of forecast and estimation models capable of generating Space Environment [40] and Spacecraft data estimations. Interaction with any of these models is accomplished using remote Web Services’ invocation, which relies on XML message-passing mechanisms;

o

Metadata Module (MR): SEIS is a metadata driven system, comprising a central metadata repository [41], that provides all SEIS applications with means of accessing shared information and configuration files;

- 37-

Related Work o

Client Tools: The SEIS system comprises two client tools, which take advantage of both the collected real time and historical data – the SEIS Monitoring Tool and the SEIS Reporting and Analysis Tool, respectively.

Figure 2.11: SEIS system architecture modular breakdown

2.8.3 Data Processing Module The Data Processing Module integrates three components – UDAP, UDET and UDOB – that act as a pipeline for data processing. Follows an explanation describing the functionalities of each component.

2.8.3.1 UDAP The Uniform Data Access Proxy (UDAP) [42] is the primary interface with the external data distributing services, acting as the start point for the data processing chain composed by UDAP, UDET and UDOB. UDAP is responsible for retrieving all input files from different data service providers’ locations via HTTP and FTP protocols, being able to cope with remote service availability failures and performing recovery actions whenever possible. This component is also responsible for preparing, invoking (through a web service layer) and process both space weather and spacecraft models data outputs generated by the estimation and forecasting 3M application. All the retrieved data is afterwards stored in a local file cache for backup purposes. Once stored, files are immediately sent for processing to UDET. UDAP also supports the addition, removal and update of its specified metadata in realtime while the application is actually running. For maintainability and reusability purposes, metadata definitions are stored in a centralized Metadata Repository. - 38 -

Space Environment Information System for Mission Control Purposes Although UDAP can be considered an engine for input files download, it has been integrated into a graphical application that enables the user to control file download at data service provider and input file level. Further, a graphical component is available that enables the visualization of all download actions performed by UDAP as well as the responses of UDET to data processing requests. Besides visualization, this component also enables filtering and querying of logging data. The application has been implemented using Microsoft .NET [43] and Internet Information System (IIS) [44] technologies.

2.8.3.2 UDET The Unified Data Extractor and Transformer (UDET) [42] is the second component in the chain of the Data Processing pipeline. The main goal of UDET is data processing, which includes performing extraction and transformation activities according to user declarative definitions – File Format Definition (FFD) - for online and offline data files received from UDAP. After processing, the results are sent to the respective UDOB (Uniform Data Output Buffer) offline or near real-time instance. The application has been implemented using Microsoft .NET and IIS technologies and can be executed in one of two ways: o

Web Service: Provides a transparent data processing mechanism, capable of accepting data processing requests and delivering processed data into the respective target UDOB. Since the processing tasks are mainly processor intensive, the deployment scenario should at least comprise two UDET-UDOB instances, one for processing and delivering of near real-time data and the other for offline data processing;

o

Portable library: Extraction and transformation logic has been gathered in a common package that can be used by external applications. The FFD Editor, capable of creating, editing and test FFDs given an example input file, would be the main user application for this library. However, due to time constraints this application has not been developed in the scope of SEIS.

2.8.3.3 UDOB The Uniform Data Output Buffer (UDOB) [42] constitutes the endpoint component for the Data Processing Module, also known as Staging Area. The primary role of UDOB is to behave as an intermediate data buffer on which the same data is made available to both the ODS, DW or any other data retrieving client. UDOB has been implemented using Microsoft .NET [43], IIS [44] and SQL Server [45] technologies.

2.8.4 Evaluation Although the SEIS system (and all its underlying components) has been enormously successful in practice, it was a prototype system and some restrictions and simplifications

- 39-

Related Work have been posed in order to reach a functional implementation within the project’s schedule. Thus, SEIS Data Processing Module presented some shortcomings: o

SEIS ETL solution was not independent from the operating system. SEIS DPM architecture was based on proprietary Microsoft’s .NET and IIS technologies, which made the usage of MS Windows operating system mandatory;

o

Although using a declarative language, suppressed the need of source code development, SEIS DPM did not follow a clear separation of concerns between domain expertise and computer-science expertise right from the project start. Such, resulted in a somewhat tangled solution;

o

In SEIS all FFDs were created without any graphical support besides a XML editor, which required extensive XML knowledge from the domain user, during the FFD definition task;

o

UDOB is too hard coded with the target data delivery database implemented with Microsoft SQL Server. A generic interface should be available, abstracting any specific reference to the target database / application, promoting this way a possible reuse of the DPM package in other problems / domains;

o

UDOB is not a feasible approach when dealing with a large set of data, where a relational staging area may become a performance bottleneck. In SEIS, when performing the processing of a massive set of historical telemetry data that would be inserted directly in the Data Warehouse, UDOB was found to be a major bottleneck. In this case a file-based approach would be more suitable than the relational scheme of UDOB. At the time a new data processing pipeline had to be developed removing UDOB and replacing it by a file-based output component. Performance was upgraded from several months to several days;

o

Data quality mechanisms (such as data typing and validation rules) were missing in SEIS declarative language. In case of change in the provided file format, invalid data could be loaded into UDOB without raising any error at the UDET level;

o

SEIS supporting language was not extensible in terms of the definition of new transformations (a common change, that is directly dependent on the way the file is formatted). If a new transformation was found to be required, a direct change to DPM’s core source code had to be performed;

o

The scalability and performance of DPM components must be improved for dealing with big volumes of textual data. The DPM solution was found to be only scalable for small text files (below 200KB). Above this threshold performance started to degrade exponentially;

o

Engine functionalities (i.e. retrieval and data processing) are not isolated from the presentation level (GUI). In UDAP, both layers were merged together in a same application, requiring further computational resources;

o

In case of failure during a data processing task SEIS DPM followed a passive approach only registering the occurrence in a log file.

- 40 -

Conclusions

2.9 Conclusions Based on the developed report describing the current state of the art for the ETL domain some conclusions can be derived: o

Academic software prototypes are quite scarce (not to say inexistent) and information is mainly available in scientific papers and journals. In most cases, the presented work does not refer to a complete ETL software solution but focus particular features of ETL, mainly related with automatic learning;

o

Open source software can be used freely and in some cases presents a suitable set of ETL tools and capabilities (although yet far away from the capabilities of commercial ETL solutions). Special care must be taken regarding the application suite development stability as well as the community support for the integration tool;

o

Commercial application suites are very complete, not only in terms of ETL capabilities but also regarding complementary tools (e.g. data profiling, grid computing) that are also property of the ETL tool vendor. Depending if the ETL tool vendor is simultaneously a DBMS vendor or not, different approaches to ETL may be followed (e.g. ELT, ETLT). However, most commercial solutions can be generalized to a metadata-based architecture where metadata is seamlessly generated by graphical client tools, interpreted and executed by some kind of engine, residing in a centralized metadata repository;

o

Most of the open source and commercial ETL tools that were analysed (independently from using a metadata driven or RDBMS engine) follow a quite similar architecture for their main components. Figure 2.12 presents an architectural abstraction for a generic ETL tool, factorizing common tools, functionalities and interactions, based on the conclusions of the ETL report conducted by the author [22].

Figure 2.12: An abstract ETL architecture A Metadata Repository appears in the centre of the architecture supporting the entire ETL suite solution in terms of business and operational metadata. Metadata is generated - 41-

Related Work seamless by the user through a Designer application that enables mapping definitions between source and target schemas and workflow design for the transformation pipeline. Depending on the availability of supporting applications, External Features (e.g. data profiling, impact analysis) can be used to understand the source data, extrapolate data quality measures or visualize the operational impact of changes over the defined ETL process. All generated metadata is placed in a Metadata Repository that is then interpreted during runtime by an Engine that executes the operations expressed in the metadata. Data processing can be performed automatically by specifying a set of schedules for execution through the Scheduler application. The overall execution of the ETL pipeline can be controlled (e.g. execution start / stop, grid computing) and monitored (e.g. visual logging) through a set of Management Tools. Follows some architectural examples for commercial ETL tools that display how the abstract architecture previously presented is instantiated: Group1 Data Flow (Figure 2.13), Sybase Transform On Demand (Figure 2.14) and Sunopsis (Figure 2.15).

Figure 2.13: Group1 Data Flow architecture [46-48]

Figure 2.14: Sybase Transform On Demand architecture [49-51] - 42 -

Conclusions

Figure 2.15: Sunopsis architecture [52-55] Besides supported by a common architecture, many of the tools also share common features, namely: o

Partial Transparency: Although most technical aspects of the ETL language are transparent to the user, technical expertise is still required (allied to domain expertise);

o

Approaches to Data Integration: Data Consolidation and Data Propagation are supported by almost all commercial applications while Data Federation and CDC are less supported since they are less required in practice;

o

Data Sources: Relational databases (to more or less extent) and standard sources (e.g. XML, Text files, ODBC, JDBC) are two types of source / target application shared by almost all ETL tools;

o

Data Quality: Data quality facilities are present in most ETL tools, usually through the definition of transformation pipelines. In some tools visual support through graphics (e.g. scatter plot) is available enabling a fast identification of data outliners and dirty data. Finally, some management tools also support data quality by advising the system administrator promptly whenever faulty data (not corresponding to the performed specification) is detected (e.g. email and SMS messaging are common);

o

Mapping Tool: Enables the direct mapping of a source schema to a target schema (possibly with a simple transformation step in between, e.g. conversion from lowercase to uppercase). All mapping information will be represented and stored as metadata. Two examples of mapping tools are presented in Figure 2.16 and Figure 2.17 for the DB Software Laboratory‘s Visual Importer and iWay Data Integrator applications, respectively;

- 43-

Related Work

Figure 2.16: DB Software Laboratory’s Visual Importer Mapping [56, 57]

Figure 2.17: iWay Data Migrator [58-61] o

Workflow Tool: Enables the creation of transformation pipelines either using a set of predefined functions made available by the ETL tool or by invoking external custom

- 44 -

Conclusions functions. All the pipeline logic is represented and stored as metadata. Two examples of workflow tools are presented in Figure 2.18 and Figure 2.19 for the Informatica and SAS ETL Studio applications.

Figure 2.18: Informatica [28, 62] workflow example

Figure 2.19: SAS ETL Studio [63-65] workflow example - 45-

Related Work o

Supporting Features: Besides the traditional mapping, workflow and management tools, ETL vendors also provide multiple supporting features either comprised in the ETL suite package or that must be bought individually as a framework extension. Classical examples of such supporting features are data mining tools (e.g. Figure 2.20), multi-user collaboration tools supported by a Concurrent Version System (CVS) (e.g. Figure 2.21) and impact analysis 8 tools (e.g. Figure 2.22);

Figure 2.20: Business Objects Data Integration (data patterns detection) [66]

Figure 2.21: Business Objects Data Integration (multi-user collaboration) [66]

8

Highlight the consequences on a change over an ETL component in the ETL process

(e.g. a change on a data source structure may cause an error in the extraction phase). - 46 -

Conclusions

Figure 2.22: Business Objects Data Integration (impact analysis) [66] o

Scheduling: Schedulers are quite common among ETL tools enabling the automatic execution of ETL operations without user assistance, given a set of predefined execution schedules. Besides the scheduler application provided by the ETL vendor, in some cases, interfaces for third-party scheduling programs are also available. Some graphical examples of Scheduler applications are provided in Figure 2.23 and Figure 2.24

for

the

DB

Software

Laboratory’s

Visual

Importer

and

Sybase

TransformOnDemand applications, respectively;

Figure 2.23: DB Software Laboratory’s Visual Importer (scheduler) [56, 57]

- 47-

Related Work

Figure 2.24: Sybase TransformOnDemand (scheduler) [49-51] o

Management Tools: Every ETL tool contains a management console where the administrator can control the overall functioning of the ETL pipeline (usually through some kind of graphical log), issue start / stop commands and in some cases generate / edit specific metadata. Depending on the complexity and power of the management tool, other non-standard administration features can also be present, e.g. distributed processing management (Figure 2.25). Management applications can be desktopbased, web-based or both;

Figure 2.25: Informatica (management grid console) [28, 62] o

Semi-structured data: Semi-structured textual data is considered a secondary data source and the considered file formats is quite restricted (e.g. fixed width, Comma Separated Values - CSV). Semi-structured data is defined through wizards with a limited set of operators for user interaction as depicted in Figure 2.26, Figure 2.27 and Figure 2.28 for the DB Software Laboratory’s Visual Importer Pro, Sybase TransformOnDemand and SAS ETL applications, respectively;

- 48 -

Conclusions

Figure 2.26: DB Software Laboratory’s Visual Importer (text wizard) [56, 57]

Figure 2.27: Sybase TransformOnDemand (text data provider wizard) [49-51]

- 49-

Related Work

Figure 2.28: SAS ETL (text wizard) [63-65]

o

Metadata Repository: ETL tools are supported to some extent by metadata that is placed in a central Metadata Repository. These repositories are supported by a RDBMS and different databases may be used for implementing the repository. Although XML-based metadata is used by some of the tools, this in the majority of the cases, is injected in a RDBMS instead of using a native XML database (e.g. eXist [67]). Most of the repositories are quite monolith in their functionalities, being specifically targeted to the ETL domain metadata, not providing a general metadata solution. Data interchanging capabilities between Metadata Repositories is not usually available nor metadata from a third-party ETL vendor can be used in another ETL vendor’s Metadata Repository.

- 50 -

Chapter 3 Decomposing ETL: The ETD + IL Approach Focusing on a novel approach for ETL, this chapter proposes a clear separation of domain from technological concerns, such that ETL = ETD + IL. First the classical ETL approach is described, analysed and evaluated in the scope of semi-structured scientific data. Then the ETD+IL approach is explained, describing specifically which are ETD and IL actions. Finally, a set of requirements is derived for accomplishing a complete data retrieval and processing solution.

- 51 -

Decomposing ETL: The ETD + IL Approach A classical ETL system extracts data from multiple data sources, enforces data quality and consistency standards through data transformation and delivers data in a preestablished format. Such an approach is not the most appropriate, especially in the context of retrieving data from the WWW (due to the huge quantity of text files that follow heterogeneous format / presentation rules). Since data present in the text files is closely related to the domain it refers to, it is fundamental to involve a domain-expert (usually without computer-science skills) in the selection, extraction and preparation / transformation of the relevant data present in the text. In order to provide a clear separation of concerns [68, 69],

this chapter presents a

different approach to ETL. A division of the well-known ETL paradigm is proposed, based on domain ETD operations (Extraction, Transformation and Data Delivery), which require domain expertise, from technical IL (Integration and Loading), that require computer science operations, such that ETL = ETD + IL. The ETD data processing solution has been devised following a set of thirteen guidelines (general requirements): (i) Free, Open Source and Independent (ii) Completeness (iii) Separation of Concerns (iv) User Friendliness (v) Performance (vi) Scalability (vii) Modularity (viii) Reusability (ix) Metadata Driven (x) Correctness (xi) Validation (xii) Data Traceability and (xiii) Fault Tolerance.

3.1 Classical ETL solutions A common ETL system extracts data from one or more source systems, enforces data quality and consistency standards through data transformation and finally delivers data in an pre-established format either for delivery to a staging area or for direct display in a graphical application (Figure 3.1).

Figure 3.1: Abstract architecture of a data warehouse Although the construction of an ETL system is usually considered a back room activity that is not visible to end users, it easily consumes 70 percent of the resources needed for implementation and maintenance of a typical data integration project [12]. ETL is both a simple and a complicated subject. Almost everyone understands the basic mission of the ETL system, but this can be easily split into thousand little sub-cases, depending on the - 52 -

Classical ETL solutions data sources heterogeneity, business rules, existing software and unusual target applications. In the Extraction phase, relevant data is identified and extracted from a data source. Since source data is usually not in a normalized format, it is required to Transform this data, either using arithmetic, date conversion or string operations. Finally, in the Loading phase, the already converted data is loaded into a target system model (usually performed via a staging area database), following a set of policies closely related with the system’s solution domain (e.g. corrective data may simply overwrite the last data entry or all values may be kept for historical reasons). Considering that data may be complex depending on the domain it refers to, a domain expert is usually required for the Extraction and Transformation tasks (Figure 3.2) in order to identify which data is relevant and how it must be transformed in order to be correctly manipulated. Although mostly computer-science related, domain expertise is also required for the Loading phase (although to a lesser extent compared to the previous two phases) indicating which final data is relevant. Since most traditional ETL approaches follow the development of specific source code or the usage of technical languages (e.g. SQL) for dealing with data sources, computer-science expertise is usually required throughout the entire ETL pipeline, from Extraction to Loading phases. Computer Science Expertise

Extraction

Transformation

Loading

Staging Area

Domain Expertise

Figure 3.2: ETL classical pipeline Such ETL approach is not the most well-suited, specially in the context of retrieving data from the WWW due to the huge quantity of text files that follow heterogeneous format / presentation rules. Since the data present in the text files is closely related to the domain it refers to, it is fundamental to involve a domain-expert (usually without computerscience skills) in the selection, extraction and preparation / transformation of the relevant data present in the text. Thus, the classical ETL process follows a three-phase iterative procedure: 1.

The domain expert identifies the relevant data and a set of procedures to be implemented by a computer-science expert;

2.

The computer-science expert codifies this knowledge (e.g. source code) and applies it to the text files;

3.

The solution is presented to the domain-expert for validation purposes.

- 53-

Decomposing ETL: The ETD + IL Approach This approach has several drawbacks. (i) The time required for the correct processing of one file increases dramatically depending on the domain and data complexity present in the file. According to the number of interactions / corrections to the initial file processing solution, the overall time for processing a single file may increase substantially. (ii) Since the logic definition for the file processing is performed by an individual outside the domain it is common that wrong assumptions are followed (e.g. data types / validation rules) that may not be detected by the domain expert during the validation phase and thus propagated to an operational environment, to be detected much later. (iii) By representing the extraction and transformation knowledge hard-coded (e.g. in the source code) makes this knowledge hard to be mechanically auditable by external domainexperts and shareable with the scientific community. (iv) Since knowledge is usually not represented in a computable way, it is not easy for external analytical programs to derive metrics regarding the way the knowledge has been codified for evaluation and improvement purposes.

3.2 Thesis: The ETD+IL Approach In order to provide a clear separation of concerns [68, 69],

this thesis presents a

different approach to ETL. A division of the well-known ETL paradigm is proposed, based on domain ETD operations (Extraction, Transformation and Data Delivery), which require domain expertise, from technical IL (Integration and Loading), that require computer science operations, such that ETL = ETD + IL (Figure 3.3). Computer Science Expertise

Extraction

Transformation

Data Delivery

Integration

Loading

Staging Area

Domain Expertise

Figure 3.3: ETD + IL pipeline Including a domain-related Data Delivery phase, the domain expert can define which processed data shall be delivered to the target system model, as output of the Extraction and Transformation phases. During Data Delivery the domain expert uses an abstraction that enables a delivery completely transparent from the target application (e.g. system database / file system / Web Server) that will use the processed data, as well as the internal structure in which processed data is stored. The new Integration and Loading phases require mostly computer-science expertise. The Integration step can be decomposed in three main tasks (Figure 3.4). First, different data deliveries (possibly from different data sources and processed by different engines) may be gathered together and in some cases a synchronization scheme may be required. Once all processed data are available, a unified view is produced. Depending on the data nature and target structure, operations such as removal of duplicates or creation of - 54 -

Requirements for ETD artificial keys may be performed in these tasks. Finally, the unified view may suffer a format change depending on the specific procedure used for data loading or on the target data store requirements / design. The Loading phase can be considered a mechanical step, since it usually consists on the invocation to a loading program given a pre-formatted ready to use data set (output from the Integration phase).

Figure 3.4: IL pipeline By differentiating domain from computer-science operations, the development time required for an ETL solution is reduced and the overall data quality is improved by a close validation of domain data by a domain-expert (instead of a computer-science expert). The ETD approach, as proposed, is supported by two core components: 1. A Declarative Language: Describing all the ETD statements required to be performed for processing an input file; 2. A Graphical Application: That makes the declarative language transparent to the domain user. Through interaction with the application the user specifies which data shall be extracted, transformed and delivered. ETD statements are stored as metadata in File Format Definition (FFD) files. Both components have been included in a complete data processing solution that is presented in Chapter 4. This solution comprises automatic file retrieval from the Internet, ETD data processing and a set of management applications for controlling and determining the status of the data retrieval + ETD pipeline.

3.3 Requirements for ETD The ETD-based data processing thesis has been analysed, designed and implemented following a set of thirteen guidelines (general requirements) that are presented next.

3.3.1 Free, Open Source and Independent The solution shall be implemented using open-source technologies, presented as a no acquisition cost solution, accessible to anyone. Furthermore, the solution shall be developed using software independent from the operating system. Presented as a no-cost data processing solution, any individual / non-profit organization may have free access to the software and test its adequacy and applicability for their specific domain problems. Following the author’s experience, this application will be specially useful within the scientific community (independently from the domain) that - 55-

Decomposing ETL: The ETD + IL Approach require a free simple data processing tool for data analysis purposes and are usually limited by a low / inexistent budget. Although the proposed software has been used in practice and in a real operational environment, the ETD related software is not presented as a closed final package. On the contrary, in the author’s perspective, the software package is presented as a first kick-off solution, open to discussion, refinement and extension within the software community. Thus, the solution shall be implemented using open source technologies, not only in terms of the software specifically developed for this package, but also regarding all its internal software components implemented by external developers. This way, the data processing solution is open to future extensions, developed by third parties. Finally, by presenting a solution independent of the operating system, no limitation is posed to the end user, regarding any special operating system, version or release.

3.3.2 Completeness A complete data processing solution shall be available comprising data retrieval, data processing and overall management of the data processing solution. Although the ETD + IL approach, supported by a declarative language and graphical editor, are the main focus of this thesis, it would have a rather limited applicability if only these two components (language and editor) would be available. In order to be considered complete, the solution should comprise at least five core components: o

Declarative language for representing ETD assertions;

o

Graphical editor for the creation of FFDs based on graphical interaction and notations, making the ETD language transparently to the user;

o

A data retrieval service that enables the automatic acquisition of data files based on a scheduler scheme;

o

A data processing service that enables the automatic processing of data files according to the FFDs previously defined;

o

Management tools for monitoring and controlling the data retrieval and processing pipeline.

3.3.3 Separation of Concerns The domain user shall be able to use and maintain the data processing pipeline without requiring computer-science expertise. All domain procedures and definitions shall be represented recurring to a high-level declarative language. No specific source-code shall be required to implement the processing of a single text file. Most data processing solutions (currently available) require computer-science expertise (e.g. programming effort, database or XML schemas, querying languages), to some extent. Such solutions, highly restrict the number of tool users to those that have knowledge / access to computer-science expertise. Domain experts that do not fit in this profile, have to execute their data processing tasks using non-dedicated tools like text

- 56 -

Requirements for ETD editors and performing a set of non-automated steps, prone to error that may alter / corrupt the data contents. In order not to restrict users, the proposed data processing solution shall not require computer-science expertise either at a low-level (e.g. programming effort) or high-level (e.g. interpreting XML and XML documents). For the envisaged solution, all programming tasks shall be replaced by ETD procedures and definitions represented in a high-level XML based declarative language. This language although closer to the end user, shall also not be manipulated directly, but made transparent by using a graphical application, providing a clear separation of concerns from domain to computer-science expertise.

3.3.4 User Friendliness A graphical application shall be available, making use of the declarative language in a transparent way to the end user. In order to accomplish a clear separation of concerns between domain and computerscience expertise, it is fundamental, that the supporting ETD language and its XML representation, be masked from the nominal user. Thus user friendliness takes particular importance during the File Format Definition creation, test and debug tasks. A set of gestures, graphical notations (e.g. icons, colours) and wizards shall be used to allow users to interact with will the tool and express how the Extraction, Transformation and Data Delivery steps shall be represented given a sample input file. Although with a minor degree of importance, the proposed graphical management tools shall also be user friendly (in terms of enabling an intuitive monitoring and control of the data retrieval and processing pipeline).

3.3.5 Performance Data retrieval and data processing shall have a reduced response time while preserving both CPU and network bandwidth resources. The data processing solution requires good performance either for data processing and data retrieval (which can be measured by the amount of time required for processing) or retrieving a file, respectively. Regarding data processing performance, two boundary cases, yet common, shall be considered: frequent requests for processing small data files and rare requests for processing large data files (e.g. 3 Megabytes in length). For each input file, the time required for its processing shall be linearly dependent to the file’s length. Data retrieval requests shall be kept to a minimum in case of retrial attempts. Depending on the type of data available in the input files, the number of retrials in case of failure shall be customized accordingly. As an example, real-time data files that may be overwritten every 5 minutes shall have less data retrial attempts when considering summary data files that may be available for retrieval during various weeks / months.

- 57-

Decomposing ETL: The ETD + IL Approach

3.3.6 Scalability Both data retrieval and data processing must be capable of handling multiple simultaneous downloads and processing requests, respectively. Scalability is a fundamental requirement for data retrieval and processing, especially for load balancing purposes. When dealing with load balancing issues, data processing takes a special relevance compared to data retrieval. For architectures where the number of provided files to process is very high, or where each provided file requires a long processing time, the data processing tasks must be parallelized and coordinated through different machines / CPUs. Otherwise, a processing bottleneck will occur and all data processing requests shall be affected, namely delayed. This situation becomes even more critical when real-time data is being handled, which can not be delayed or its relevance will be lost / diminished. A similar solution must be applied to the data retrieval process, whenever the number of provided files to retrieve is very high, or when each provided file requires a long retrieval time, due to its length. In these cases, the data retrieval effort must be parallelized through different network connections, in order not to reach a network bottleneck situation. For situations that involve proprietary data, both data retrieval and processing components must be customizable and scalable enough that individual subsets of provided files can be retrieved and processed using specific network connections and CPUs, respectively. The data processing solution shall also be able to cope with mixed public and private sets of data, separating both data retrieval and processing pipelines.

3.3.7 Modularity The solution architecture and implementation shall be as modular as possible, clearly separating the ETD pipeline from the IL pipeline. Further, there shall be a clear separation between logic and presentation layers, easing future maintenance tasks. Data retrieval and data processing shall be separated in two independent services that may be installed in different machines. In both cases the services provided by each engine shall be self-contained, not being limited to the retrieval / processing logic but also enabling to control the service execution as retrieving logging information for monitoring and debug purposes. The graphical presentation layer for both engines, shall be decoupled from their logical layer,

through

independent

applications.

While

the

logical

layer

shall

execute

continuously, the graphical layer may be executed only on user request. Such separation shall be performed in a way that a remote machine may be used for monitoring and / or controlling the status of both engines. A generic interface shall be proposed for Data Delivery, abstracting completely the Integration and Loading processes, which are highly coupled with the target database / application that receives all processed data.

- 58 -

Requirements for ETD

3.3.8 Reusability System modules shall be designed and implemented focusing on reutilization as much as possible. Such approach shall be applied for factoring common behaviour / functionalities within the data processing solution itself or for reusing entirely / partially system components in the solution of other problems. The data processing components shall consider reusability requirements, from the early requirement / design phases, in order to be implemented as generic as possible, abstracting domain specific characteristics. During system design, four main components / libraries have been selected as good candidates for reutilization, within the data processing solution itself: o

All ETD logic shall be factorized in a single library, common to the application responsible for FFD creation as well as for the data processing engine;

o

Both data retrieval and processing tasks produce a considerable amount of logging data that can be used for debug purposes or for performing data traceability. All lowlevel logging logic (i.e. reading and writing) shall be developed in a single library, shared by both engines, guaranteeing that all log files follow the same format and rules. A similar approach shall be followed for the graphical presentation of logging data. A graphical component shall be developed that enables the presentation of log events with filtering and querying capabilities;

o

Since all data retrieval and processing metadata is codified in XML format, a library for reading, writing and querying XML is required by multiple applications. This library shall constitute an extra layer, closer to the developer, compared to Java’s XML facilities that are to low-level;

o

Regular expressions shall be massively used within the data processing solution either in the FFD creation step or when applying them for actual data processing.

Further, the entire data processing solution, due to its generic implementation shall enable (with the correct metadata configuration) reusability as a freely COTS package for data processing.

3.3.9 Metadata Driven The data processing solution shall be metadata driven, which means that all processes for executing and managing the data retrieval and ETD pipeline are based on metadata. As previously pointed, the pipeline formed by the data retrieval and ETD engines shall be a generic solution, capable of being customized for handling data from different domains. The customization of the data processing solution shall be performed through the creation / edition of metadata instances, for a given pre-defined set of generic concepts that describe the structure, data types and rules that each metadata instance shall follow. XML technology shall be used for storing metadata instances, while concepts shall be represented via XML Schema.

- 59-

Decomposing ETL: The ETD + IL Approach Good examples of metadata (in this context) are the FFDs used for file processing, application’s configuration files and scheduling information for the files to be retrieved. Metadata acts, in a very simplistic way, as a set of configuration files that specify how the pipeline shall react and execute. The interaction between the domain user and technical metadata shall be made transparent. Depending on the type of metadata, specific editors shall be used. For example, when dealing with data retrieval metadata a form like graphical application shall be presented to the user to insert the metadata (independently from the internal format in which it will be stored). However when dealing with the creation of FFDs, a specialized graphical application shall be used. Metadata definitions can also be used for describing default data flow operations at application start (e.g. start / stop of a file retrieval action), describe the pipeline architectural configuration (e.g. number and location of the data retrieval and data processing components) or merely store user preferences settings.

3.3.10 Correctness Data typing facilities and validation rules shall be available during the entire ETD process, in order for the outcome of ETD to be valid. These data quality mechanisms shall be applied iteratively in the Extraction, Transformation and Data Delivery steps. The input files’ structure is not static and may evolve in time. Commonly changes are motivated by the addition and / or removal of a new parameter (e.g. a column in a table). During the data processing tasks such changes must be detected as soon as possible within the ETD pipeline. The file structure and data correctness, shall be initially validated at the Extract phase by the inclusion of data typing mechanisms for all extracted data (e.g. String, Numeric, Date) and validation rules. Such validation rules may be relative to the file structure itself (e.g. a section cannot be empty, a section must contain between 5 to 15 lines) or to data boundary conditions (e.g. string parameter values must have at least 5 characters in length, integer parameter values must be inferior to 1000). During the transformation phase, data typing shall also be applied to all the input values of each transformation function before its execution. If an error is detected, the file processing shall be stopped and the system administrator shall be notified by email. Correctness is a very important requirement since it prevents that corrupted / invalid data are propagated to the Data Delivery phase. Without any validation / data typing scheme, a subtle change in the file format could result in string values to be delivered where date values were expected, as an example.

3.3.11 Validation After performing the ETD specifications based on a primary input file, the FFD generality shall be tested with a larger set of text files belonging to the same class of files.

- 60 -

Requirements for ETD Each FFD shall be created with a specialized editor (based on a primary sample input file) where the domain user annotates all relevant data to extract, the transformation pipeline and how processed data shall be delivered. However, and since only one input file is used in the process, the ETD assertions defined for the file may be over fitted (not general enough) for processing all input files from that class. In order to produce FFD metadata, which is generic enough, more than one text file shall be used in the FFD validation. If the FFD is found not to be general enough then the domain user can perform modifications, before making it available for online data processing.

3.3.12 Data Traceability It shall be possible to trace-back a processed datum value, to the originally downloaded file. Depending on the processing logic present in the FFD metadata, slight modifications to the input file structure, may cause incorrect values to be propagated into Data Delivery (if no validation / data typing restrictions are imposed). Due to the possible high number of concurrent data processing requests, from multiple input files, it may not be intuitive which input file and FFD version caused an incorrect data delivery. Thus, data traceability methods are required to identify both the FFD (whose logic shall be corrected) as the input file that raised the incorrect data delivery (for posterior testing). Data traceability shall be performed via logging information made available by the data retrieval and the processing engines, as well as the Data Delivery interface. Since input files are mostly overwritten or become unavailable at the source Data Service Provider site (after some time), in order to accomplish complete data traceability, a cache directory shall keep all the retrieved input files, as downloaded from the Data Service Provider site.

3.3.13 Fault Tolerance In case of failure during download, the recovery of the failed file shall be retried. If an error occurs during the data processing the administrator must be notified and other data processing operations shall be resumed. Since many scientific Data Service Providers are maintained with limited / inexistent budget resources, data availability cannot be guaranteed by the provider nor any notification scheme is implemented, that warns Data Service Provider clients about site unavailability periods or modifications to the input files structure. Regarding data availability issues, the data retrieval solution shall enable retrial and recovery of input data, based on the data nature present in the files. In case of real-time data the number of attempts shall be reduced (e.g. up to 5 minutes), otherwise real-time relevance can be lost. However, if summary data is available, data values are considered constant and are usually available during long periods (e.g. months), so multiple retries can be attempted during this period. - 61-

Decomposing ETL: The ETD + IL Approach The occurrence of data processing failures is usually related with a change on the file’s structure. On failure the data processing activity shall stop and the administrator shall be notified about the error. Independently from the type of failure, the application shall log the failure event, disregard the related task and resume other data retrieval / processing tasks. The occurrence of a failure shall be transparent to any other data retrieval and data processing requests.

- 62 -

Chapter 4 Data Processing Module This chapter presents a complete Data Processing solution based on the proposed ETD+IL approach. First, the main technologies involved in the construction of the data processing solution are introduced, as well as how they have been weaved together. Follows a discussion regarding the solution’s architectural design. Then, each individual component of the Data Processing solution is described. Depending on the component’s complexity, its internal data flows and functionalities are explained, as well as, the core services made available to external applications (if any).

- 63 -

Data Processing Module This chapter presents a data processing solution based on the requirements established for the ETD+IL approach, discussed in Chapter 3. The solution is described as complete since it is not limited only to the core ETD components (i.e. FFD Editor application and its supporting FFD language), but comprises a set of applications that enable: o

Automatic data retrieval from data service providers;

o

Creation and customization of metadata related with data retrieval;

o

Controlling the data retrieval / ETD pipeline;

o

Monitoring the data retrieval / ETD pipeline.

Such solution is based on open-source and platform-independent technologies that are presented next as well as how they have been weaved together. Then, the internal architecture for the solution is described, with all its internal server services and client tools. Special attention will be placed on the scalability and performance of the data processing solution. Follows an individual presentation of all system components, starting with server services and followed by the client tools 9. For some applications, the explanation of their functioning shall be supported with the description of some internal metadata concepts. Although referenced, ETD core components will not be presented in much extent in this chapter since Chapter 5 is dedicated to them.

4.1 Technologies This section presents the main technologies used in the development of the data processing solution. All technologies have been evaluated according to its current usage in information systems, technological maturation level and following the requirement of open-source and platform-independent software.

4.1.1 XML XML [70] is a meta markup language that provides an open standard for describing documents containing structured information. XML allows tags to be defined by the data publisher or application developer. By providing a common method for identifying data, XML is expected to become the dominant format for data transformation, exchange and integration. XML’s core syntax became a World Wide Web Consortium (W3C) recommendation in 1998 and since then XML has rapidly been adopted. Using XML, information publishers can define new tags and attribute names at will. Rather than being constrained to defining a particular set of data, XML is able to work with DTDs and XML Schema to define any number of documents that form a language of their own. Thousands of specific document vocabularies have already been created using

9

Only the main functionalities / interactions regarding the client tools are presented in

the scope of this report. - 64 -

Technologies XML to meet the specific needs of various industries including financial, legal publishing and health care. XML documents do not contain processing specifications or limitations and can be freely exchanged across multiple platforms, databases and applications as long the subscriber data stores and applications are XML-aware. Any system, mobile device or application that speaks XML can access and manipulate data present in a XML document at will.

4.1.2 XML Schema In order to specify and validate an XML document, a schema language is required. A schema file shall define a set of rules to which an XML document must conform, in order to be considered valid (according to that schema). Since XML is essentially a subset of the Standard Generalized Markup Language (SGML), the first approach (as a schema language) was to use DTD. However, the DTD language is quite limited, specially in terms of data types. As solution, the W3C proposed XML Schema, a much more powerful and XML-based language that enables constraints about the structure, content of elements and attributes of the document, as well as creation of data types, allowing a much richer description of the intended XML documents.

4.1.3 XPath XPath (XML Path Language) is a syntax for accessing fragments of an XML document, somewhat like a simple query language. This syntax allows retrieving and transforming branches of an XML document tree, though several axis, node tests, predicates, functions and operators. XPath is not a technology per se, meaning that it is not used on its own, but in several XML related technologies such as XSL and XQuery. Moreover XPath is supported by most XML parsers and XML-related engines.

4.1.4 XSLT XSLT (eXtensible Stylesheet Language Transformations) is a language for transforming XML documents. Given an input XML file and an XSLT style sheet, an XSLT processor generates an output (XML or non-XML) file. Since XSLT is a language, an XSLT style sheet consists in a set of rules, which specify how the input tree elements are mapped into an output file. The mapping can be a copy, transformation or simply ignoring the content. Generating an XML file, it is possible to port documents from one application to another with different input / output formats. Otherwise, generating non-XML files it is possible (although not limited to) to create reports or other end-user documents such as (plain) text, Hyper Text Mark-up Language (HTML), Post Script

(PS), Rich Text Format (RTF) or even Portable Document Format

(PDF) documents. XSLT processing can be performed server-side using a library, client-side using a modern web browser or during development, using an XML specialized editor.

- 65-

Data Processing Module

4.1.5 XQuery XQuery (XML Query Language) is a query language for extracting and manipulating collections of data from an XML document or database. Other than querying, it also comprises some functional programming features. It is closely related to XPath and XSLT, since it uses XPath expressions, and just like in XSLT, the result output can be an XML document or non-XML content.

4.1.6 Apache Tomcat HTTP Server The Apache Software Foundation is a non-profit organization. It is well-known thanks to the Apache HTTP Server (usually know as Apache) and is the most worldwide used HTTP server. This foundation supports a large number of open-source projects, including Apache Tomcat, a web container software that implements the Web component contract of the J2EE (Java 2 Enterprise Edition) architecture, namely the servlet and JSP (Java Server Pages) specifications. Following these requirements an environment is created where Java code can execute within the Web Server and interact with remote applications. Apache Tomcat is a standalone server supported by a single JVM (Java Virtual Machine).

4.1.7 SOAP SOAP (Simple Object Access Protocol) is a protocol for exchanging XML-based messages (usually) through HTTP. While the use of HTTP is not mandatory, it is the most common, due to its friendliness to firewalls. As long as SOAP is using HTTP, problems with security should be kept to a minimum, since the HTTP port (80 or 8080) is usually open. SOAP can be used to simply exchange messages or to implement client/server applications. Since it is based on HTTP and XML, SOAP is platform independent and simple to implement, based on the various HTTP and XML tools, libraries and applications. Together with HTTP, SOAP is used as the basis of the Web Service stack.

4.1.8 Web Services The W3C defines a web service as a software application designed to support interoperable machine-to-machine interaction over a network. Web services are frequently just Application Programmer Interfaces (APIs) that can be accessed over a network (such as the Internet) and executed on a remote system hosting the requested services. The W3C web service definition uses SOAP-formatted XML envelopes and have their interfaces described by a Web Service Definition Language (WSDL).

4.1.9 Java Java is an object-oriented programming language developed by Sun Microsystems. Java applications are platform-independent since they are compiled to bytecode, which is compiled to native machine code at runtime. This programming language is widely used and support is highly available within the computer-science community. Graphical applications can also be developed using Java, which can be customized to present a - 66 -

Technologies Java standard look and feel or use the graphical settings of the platform where the program executes.

4.1.10 Regular Expressions The origins of regular expressions 10 [71-73] lie in automata theory and formal language theory, both of which are part of theoretical computer science. These fields study models of computation (automata) and ways to describe and classify formal languages. A regular expression is a textual statement, applied to a text, usually for pattern matching purposes. In a very simplistic way, a regular expression can be considered as a wildcard that expresses a matching criterion (e.g. *.txt to find all text files in a file manager). A match is a piece of text (sequence of bytes or characters) which obeys to the rules specified by the matching expression and the by the regular expression processing software as well. “\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b” is an example of a complex regular expression pattern. It describes a series of letters, digits, dots, underscores, percentage signs and hyphens, followed by an @ sign, followed by another series of letters, digits and hyphens, finally followed by a single dot and between two and four letters. I.e. the pattern describes an email address. With this regular expression pattern, the user can search through a text file to find email addresses, or verify if a given string (text applied to the regular expression) resembles an email address. A regular expression engine is a piece of software that can process regular expressions, trying to match the pattern to a given string. Different regular expression engines are not fully compatible between each other. The most popular regular expression syntax was introduced by Perl 5. Recent regular expression engines are very similar, but not identical, to the one of Perl 5. Examples are the open source Perl Compatible Regular Expressions (PCRE) engine (used in many tools and languages like PHP), the .NET regular expression library, and the regular expression package included with version 1.5 and later of the Java Development Toolkit (JDK).

4.1.11 Applying the Technologies This section presents a Knowledge Model that correlates all the technologies previously presented (Figure 4.1). As easily depicted, most of the technologies are XML-related (either directly or indirectly). XML Schema is used for validating XML documents both in terms of structure and data typing. XPath is used for querying XML documents, accessing information in specific wellknown nodes of the XML tree. On the other hand XSLT is used for transforming XML documents, usually for rendering the outputs of a query (through the XQuery language) to a specific format as HTML.

10

Usually, abbreviated to Regex or Regexp. - 67-

Data Processing Module

Figure 4.1: Technology knowledge model A Metadata abstraction has been defined, whose instances are codified through XML and concepts are defined with XML Schema. [74] Within the Data Processing module, XML is widely used, either by Web Services and External Applications (as a Metadata abstraction) or at a communication level, whenever SOAP messages are exchanged. Both Tomcat HTTP Server and Web Services are supported by the Java language, which is also used in the implementation of all External Applications (mainly graphical applications). Further, Java also supports a set of Regular Expression libraries used in the textual processing of semi-structured text files. The HTTP Server Tomcat is a web repository that contains all the Web Services that are made available in the scope of the Data Processing module. SOAP is used as communication middleware between the Web Services and other External Applications.

4.2 Data Processing Module Architecture In this section, a general architecture for a Data Processing Module, that strongly supports the ETD declarative assertions present in the FFD language, is proposed. The architecture depicted in Figure 4.2 provides an actual concretization of the ETD+IL approach.

- 68 -

Data Processing Module Architecture

(PF, FFD)

Data Service Providers

FR Engine

FFD Editor

ETD Engine

Data Delivery Interface

DPM Console

Metadata Repository

Integration Loading

Log Analyser

Figure 4.2: Data Processing Module architecture The Data Processing Module presented in Figure 4.2 follows a three-color scheme. Green components, i.e. Data Service Providers and Integration and Loading, are external to the Data Processing Module and present themselves as interfaces for data acquisition and data delivery respectively. The dark blue component, i.e. Metadata Repository [75], represents a non mandatory component that has been developed outside the scope of the Data Processing Module, but its integration with the DPM solution is advisable for managing all DPM metadata 11 [75]. The remaining light blue components and data flows have been specifically developed as part of the Data Processing Module solution: 1. File Retriever (FR) engine: Responsible for the acquisition of input data files from external data service providers; 2. Extractor, Transformer and Data Delivery (ETD) engine: Responsible for applying a File Format Definition to an input file, thus producing a resulting set of data deliveries; 3. Data Delivery Interface: A generic web service responsible for receiving the processed ETD data and applying / forwarding it to the Integration logic. The Integration layer is responsible for the construction of a unified view combining multiple data deliveries together (possibly from different data sources) that are Loaded into an application or data repository afterwards; 4. File Format Definition Editor: A graphical tool that enables the creation, debugging and testing of ETD statements based on the FFD language. All user-machine interaction is performed via graphical gestures / wizards / annotations over a sample file;

11

The Metadata Repository software component constitutes the Master Thesis of Ricardo

Ferreira and no specific details will be presented in this report. Although not mandatory for using the Data Processing Module, the usage of the Metadata Repository is considered advisable. The integration of both technologies has been tested successfully in the scope of the SESS system (Chapter 7). - 69-

Data Processing Module 5. DPM Console: A graphical tool that enables controlling and monitoring both download and data processing actions.

The application enables starting / stopping

file downloads, metadata creation / edition, as well as, consulting and filtering of all generated logging information; 6. Log Analyser: A graphical tool for analysis and querying of offline logging information. The File Retriever and ETD engines (implemented as web services) execute continuously and are responsible for the download / ETD chain. Both engines can be deployed in the same or in different machines according to data policies and / or performance needs. The FR Engine is responsible for downloading all text files from the Data Service Providers and sending each downloaded file for processing to an ETD Engine together with the appropriate FFD (containing all the ETD actions to be applied). Processed data is then sent to the Data Delivery Interface, responsible for the implementation of all Integration and Loading actions. All metadata required for both FR and ETD engines is stored in a specialized Metadata Repository. In order to control and visualize the actions of both FR and ETD engine, the graphical DPM Console application is available to the system administrator. Using a sample input file as example, the domain expert can create a new FFD using the graphical FFD Editor application. Upon creation, the FFD is uploaded to the Metadata Repository and becomes available to both engines. Finally, the Log Analyser application enables visualization and querying of previous log files, created by the FR or ETD engines.

4.2.1 Scalability The proposed DPM architecture is highly scalable and can be easily configured by the system administrator (using the DPM Console tool) through metadata customization. Since the FR and ETD engine instances can be deployed in the same or different machines, in the architecture configurations depicted in Figure 4.3, each FR or ETD component represents a distinct instance of that service. Depending on the system administrator, a single DPM Console may be used to control all FR and ETD services or multiple consoles may be used to control subsets of these services.

Figure 4.3: FR versus ETD architecture

- 70 -

Data Processing Module Architecture The architecture presented in Figure 4.3 a) represents the simplest architecture where only one FR and ETD engines are defined. This architecture is suggested when both the data to download and to process is of reduced / moderated volume. In Figure 4.3 b) a one-FR-to-many-ETD architecture is presented, where the number of ETD engines is customizable by the system administrator. This architecture configuration is recommended when the volume of retrieved data is reduced / moderate (a single FR is capable of performing the task) but the processing load is high, being distributed by several ETD engines executing in different machines (load balancing schemes are addressed in the next section). Finally, Figure 4.3 c) architecture presents a solution where the data to retrieve and to process is quite high. In this situation multiple FR and ERT engines divide retrieval 12 and data processing loads, respectively. Figure 4.4 presents two possible architectural configurations for ETD and Data Delivery components.

Figure 4.4: ETD versus Data Delivery architecture The simplest setup is presented in Figure 4.4 a) where a single ETD engine is responsible for processing all data, which is then forwarded to a single Data Delivery Interface. In Figure 4.4 b) a multi ETD / Data Delivery architecture is depicted containing a manyETD-to-one-Data Delivery association, e.g. when a big volume of data exists to process, it must be parallelized via multiple ETD engines, but all processed outcome is delivered to a single output point. Independent from this pipeline, other parallel pipelines may exist (i.e. ETD Z and Data Delivery Z). The decision in maintaining independent data processing pipelines may result from different reasons, ranging from proprietary /

12

In the FR context, data retrieval load shall be understood as the required network

bandwidth required for performing all I/O bound actions (not related to CPU actions). - 71-

Data Processing Module copyright issues that may require that the data processing be kept in isolation or to scalability factors (e.g. an isolated pipeline for processing big volumes of offline data, that are delivered to a dedicate Data Delivery Interface performing IL operations differently from the real-time pipeline).

4.2.2 Load Balancing Within the Data Processing Architecture load balancing is important specially when the amount of data to process is high and cannot be performed uniquely by a single ETD component. Figure 4.5 presents three architecture schemes for balancing the data processing load through different ETD engines. In Figure 4.5 a) an example of round-robin load balancing is depicted. After specifying a pool of ETD components that are seen as data processors, each new input file (also known as Provided File - PF) is distributed equally for processing to a pool of ETD components (from first until last) in a cyclic way. Another approach for load balancing is present in Figure 4.5 b). In this approach specific load balancing is used, where there is an association between each input file and a dedicated ETD component for data processing. Finally, Figure 4.5 c) results from an hybrid architecture of both Figure 4.5 a) and b) approaches where FR A performs round-robin load balancing and FR B performs specific load balancing.

Figure 4.5: Load balancing architecture

- 72 -

File Retriever Engine

4.3 File Retriever Engine The File Retriever engine enables the automatic retrieval of data files based on timeschedulers and has been implemented as a Web Service. The engine contains all the logic required for remote file acquisition and enables remote control through a set of web methods defined as a public API. House-keeping log information is also made available to external applications through a simple subscription service, where external applications, after registration, receive periodically all logging events that take place at the engine (in XML format). The application is highly metadata oriented, specially regarding the Data Service Provider and Provided File metadata concepts (described in detail in the following sub-section). At the Data Service Provider level, the user can specify the connector type for reaching the Data Service Provider: HTTP, FTP, Web Service, Binary file or Database (through JDBC). Although the connector holds specific metadata (regarding each particular connection), when executed by the File Retriever Engine, these details are abstracted since a common connection interface is implemented by each connector type. Data requests, storage of the downloaded contents into the cache directory, occurrence of retrieval errors and possible retrial operations are some examples of operations performed by the File Retriever, where the specific data connector is made abstract. At the Provided File level, a reference to the FFD that will be used for processing the file is available, as the source file path / SQL query / or web service arguments, depending on the related Data Service Provider connection type. For each Provided File it is also possible to specify the path for the target file to be stored (in the local cache) and which type of schedule shall be used for retrieving the file. Finally, each Provided File may have a set of relations to a pool of dedicated ETD engines for processing. Figure 4.6 presents the performed tasks by the FR Engine when initiated. At application start, the first action to be executed is a query to the Metadata Repository, requesting all the Data Service Providers and related Provided Files metadata and store them as XML files in a local folder. If a connection to the Metadata Repository is not available at the web service start up, but previously saved metadata is available, then the application is started according to this metadata. Otherwise, if no metadata is locally available, nor a connection to the Metadata Repository can be established, then the web service is not initiated. Then, each declared Data Service Provider is analysed according to the Active Boolean flag element. If inactive, all Data Service Provider and related Provided Files information is disregarded. Otherwise a DSP dispatcher thread, that will manage all downloads for that specific Data Service Provider, is initiated. After launching the dispatcher thread, all Provided Files are processed and for those set to Active, the retrieval time for the next file request is calculated (depending on the type of scheduling that has been defined for the provided file). The schedule request is then added to a time-based queue belonging to the scheduler thread. When all Data Service Providers and Provided Files have been iterated, the Scheduler thread is initiated, analysing the requests stored in its queue.

- 73-

Data Processing Module

[Metadata Unavailable]

Load DSP and PF Metadata

[Metadata Available]

[No]

DSP Metadata to process?

[Yes]

Get DSP from metadata

[No] DSP active?

[Yes]

Start DSP Dispatcher

[No]

Start Scheduler

PF Metadata to process?

[Yes]

Get PF from DSP metadata

[No] PF active?

[Yes] Add PF Schedule to Scheduler

Figure 4.6: FR Engine actions after being launched Figure 4.7 presents the state chart diagram for the Scheduler thread tasks. This thread executes in a closed loop, which is only terminated when the FR Engine application is stopped. When started, the Scheduler sorts all requests present in its queue according to their next download date / time. Every Δ milliseconds (e.g. 500 milliseconds) the thread resumes from its wait state, retrieves the next download request from its queue and determines if it is time for the data retrieval to occur. If negative, the thread is placed in wait state for another Δ milliseconds. Otherwise, the schedule is removed from the queue and sent to the dedicated Data Service Provider dispatcher for retrieving data for that Data Service Provider. Then the Provided File’s schedule is updated with the date / time for its next request and put back into the scheduler’s queue. - 74 -

File Retriever Engine

Sort Scheduler Queue

Get Next Download Request

Request ready to Download?

[No]

Wait 500 milliseconds

[Yes] Remove from Scheduler Queue

Request download to DSP Dispatcher Thread

Prepare PF Schedule for next request

Add PF Schedule to Scheduler Queue

Figure 4.7: Scheduler actions Figure 4.8 presents the state chart for the Data Service Provider dispatcher thread tasks. This thread also executes in a closed loop, which is only terminated when the FR Engine application is stopped or the Data Service Provider is set to Inactive. When started, the Scheduler sorts all the schedule requests residing in its queue. Every Δ2 milliseconds the thread resumes from its wait state, retrieves the next download request from its queue and determines whether it is time for the data retrieval to occur. If negative the thread is placed on wait state for another Δ2 milliseconds. Otherwise, the dispatcher queries its internal pool of connectors (i.e. open connections to a HTTP server, FTP server, Web Service, Binary file or JDBC Database), determining if at least one connector is available 13 for data retrieval. If no connector is available, then the application waits Δ2 milliseconds before re-querying the connector’s availability. This process is repeated cyclically until a connector becomes available. When available, the request is removed from the queue and the file is retrieved via a specialized connection (as described in the Data Service Provider metadata). If the retrieval is successful, then the downloaded file is placed in the Cache directory and a log entry is created. Otherwise, the request may be retried until a maximum number of retries has been

13

Due to performance / bandwidth usage issues, the administrator may limit the number

of simultaneous connections. - 75-

Data Processing Module reached. This number depends if the provided file refers to real-time data (up to 4 retries within 5 minutes time) or summary data (up to 16 retries within a week time), following a geometric distribution. If the maximum number of retries has been reached, then the request is dropped and a new log entry is created. Otherwise, a new request is prepared, adding a time delay and placed back in the priority queue, which is then re-sorted.

Sort Requests Queue

Get Head Request

[Yes]

Queue is empty?

Wait

[No]

Available connection for retrieval?

[No]

[Yes]

Remove Request from Queue

Retrieve data

Retrieval successful?

[Yes]

Move file to Cache

[No]

Send error message to the user

Maximum retries reached?

Send success message to the user

[Yes]

Error message to the user

[No]

Prepare request for retry

Add request to queue

Figure 4.8: Data Service Provider Dispatcher actions

- 76 -

File Retriever Engine

4.3.1 Main Metadata Concepts The FR engine is mainly supported by two metadata concepts: Data Service Provider and Provided File. Together, these two concepts hold all the information regarding data retrieval, e.g. download connection specificities as protocols, username or password, scheduling information and the reference(s) to the ETD components and FFD that will be used for data processing. Metadata instances for these two concepts are created and edited through specialized graphical editors available in the DPM Console application, while the FR Engine uses the instance contents for actually performing the data retrieval.

4.3.1.1 Data Service Provider The Data Service Provider concept contains all the specific information for connecting to an external Data Service Provider site and the references to the Provided Files instances available for that site. In all metadata instances there are two mandatory elements required by the Metadata Repository

that

manages

DPM

metadata:

identificationElementsGroup

and

documentationElementsGroup. The identificationElementsGroup contains the elements that uniquely identify an instance and provide a description for it, i.e. Name, ShortName and Description. The documentationElementsGroup contains information regarding the creation of the metadata instance, i.e. Author, CreationDate, ModificationDate, Status and Comments. Five other elements are defined under the DataServiceProvider node (Figure 4.9).

Figure 4.9: Data Service Provider schema The Active element acts as a Boolean flag indicating if the metadata instance is active or if it has been removed. In order to provide metadata traceability facilities, Data Service - 77-

Data Processing Module Provider instances are marked as not Active and preserved in the Metadata Repository, instead of being removed. The Started element is another Boolean flag that indicates if the data files from the Data Service Provider site shall be retrieved (true value) or not (false value) when the File Retriever service is started. This parameter is defined by the user through the execution of start / stop commands at the DPM Console graphical application (Section 4.7). In

the

Acknowledgment

element,

a

textual

description

is

available

providing

acknowledgment to the Data Service Provider site that provides the data. A connection to a Data Service Provider can be established in one of three ways (Figure 4.10): o

Binary program: Defined by the ProgramPath to execute the binary program and a possible set of Arguments sent via command prompt;

o

Web application: With Type set to HTTP, FTP or Web Server, a URL and Port number in the Internet and possibly Login and Password fields;

o

Database: Defined by a DatabaseName that is available in a HostName machine, possibly requiring Login and Password validation.

Finally, the ProvidedFiles element holds a set of references to Provided Files instances that are required to be downloaded from the provider site.

Figure 4.10: Connection element

- 78 -

File Retriever Engine

4.3.1.2 Provided File The Provided File concept (Figure 4.11) contains all the specific information for referencing one input file within a Data Service Provider site, the directory and naming conventions for storing the downloaded file and the ETD engine and FFD references to be used in the data processing phase. Besides the identificationElementsGroup and documentationElementsGroup, eight other elements are defined under the ProvidedFile node. The Active element acts as a Boolean flag indicating if the metadata instance is active or if it has been removed. In order to provide metadata traceability facilities, Provided File instances are marked as not Active and preserved in the Metadata Repository, instead of being removed. The Started element is another Boolean flag that indicates if the input file shall be downloaded (true value) or not (false value) when the File Retriever Service is started. The isSummary element also acts as a Boolean flag and indicates if the input file refers to real-time or summary data. According to this metadata, the data file may be processed with different priorities as well as different recovery actions can be attempt in case of failure during the download.

Figure 4.11: Provided File schema

- 79-

Data Processing Module Associated

to

each

Provided

File

a

reference

to

a

FFD

instance

may

exist

(FileFormatDefinitionRelation), if the input file after download is intended to be processed. Figure 4.12 presents the child nodes for the Provided File Source element that are closely related with the Connection type defined at the Data Service Provider (Figure 4.10). If the Data Service Provider refers to an ExternalProgram, HTTP or FTP connection, then the File element - Figure 4.12 a) - shall be defined as well as the Directory and FileName elements, forming the path where the program to execute resides (if the Data Service Provider refers to an ExternalProgram) or the reference for the file to download (if the Data Service Provider refers to an HTTP or FTP connection). Otherwise, if the connection refers to a Web Service then the File element must be defined, as well as, the Directory and WebMethod elements.

Figure 4.12: Source / File element (up) and Source / Database element (down) The WebMethod element definition (Figure 4.13) is performed by instantiating the Name of the web service and a possible optional set of arguments (defined by an argument Name, DataType and the Value to be invoked when calling the web method).

Figure 4.13: WebMethod element - 80 -

File Retriever Engine Finally if the Data Service Provider refers to a Database connection - Figure 4.12 b), then the Database element shall have a Query element containing an SQL query to be executed in the database and an optional set or arguments (similar to the Web Service connection). Next, a set of four data query examples is presented, one example for each of the four possible connection types: FTP query (Figure 4.14), Binary query (Figure 4.15), Database query (Figure 4.16) and Web Service query (Figure 4.17). Each example is comprised by two metadata subsets from the Data Service Provider (Connection element) and Provided File (Source element) instances that define each connection. The TargetFile element contains the Directory and FileName where the downloaded file shall be placed and renamed, respectively.

ftp ftp.sec.noaa.gov anonymous [email protected] 21 /pub/forecasts/45DF/ MMDD45DF.txt

Figure 4.14: Example of a FTP Query C:/Generator/start.exe C:/Generator/outputs/ output1.txt

Figure 4.15: Example of a Binary Program Query

Clients MainDatabaseMachine Administrator apassword select name from clients_table

Figure 4.16: Example of a Database Query

- 81-

Data Processing Module web service http://localhost 8080 SCWebService orbitPropagatorService Satellite Name String XMM

Figure 4.17: Example of a Web Service Query A set of naming conventions has been defined for the Filename, Source and Target file acquisition, XML elements: o

YY: The last two year digits for the current year date;

o

YYYY: Four digits for the current year date;

o

MM: Two digits for the current month date;

o

dd: Two digits for the current day date;

o

HH: Two digits for the current hour time;

o

mm: Two digits for the current minute time.

Figure 4.18 presents the available scheduling schemes for file retrieval purposes. If the ScheduleOptions element is not defined then no automatic scheduling is defined for the file and its data retrieval will only occur on user request.

Figure 4.18: ScheduleOptions element

- 82 -

ETD Engine If defined, three scheduling schemes are available: o

Retrieve whenever a time condition is reached: (i) Daily: When an hour and minute are reached, (ii) Monthly: When a day, hour and minute are reached or (iii) Yearly: When a month, day, hour, and minute are reached.

o

Retrieve every X seconds;

o

Retrieve only on a predefined set of dates.

Finally, the Routing element (Figure 4.19) enables to specify the load-balancing scheme as defined in section 4.2.2. If the Routing element is not defined then round-robin load balancing will be applied, using as pool of ETD processors all the ETD components defined in the Metadata Repository. If the element is defined, then a specific load balancing will be used. For this purpose a set of ETD components must be selected, that will be used for forwarding the input files for processing. Multiple ETD engines may be defined, serving the data processing requests cyclically.

Figure 4.19: Routing element

4.4 ETD Engine The ETD Engine is responsible for the actual ETD processing and has been implemented as a Web Service. Contrary to the FR engine, that is an active component that fetches data files, the ETD engine is a passive component that processes data files on request. After conducting a successful download, if the Provided File metadata contains a FFD reference, then the FR Engine sends to the ETD Engine the text file contents and a FFD identifier to be applied. Figure 4.20 presents a simple diagram displaying the functioning of the ETD Engine input / output data flow. Each input file is processed in isolation from the remaining input files and may comprise multiple data sections. Each datum present in these sections is time referenced and corresponds to an event measure at a given moment in time. Thus, by sorting the date / time fields it is possible to establish a time frame to which the input file data refers to.

Figure 4.20: ETD Engine input / output data flow

- 83-

Data Processing Module Every input file can be classified according to its data contents (i.e. real-time, summary or ad-hoc) and has one of three associated priorities: (i) high priority realtime data: data regarding events that have occurred in a near past, e.g. 5 minutes range (ii) medium priority summary data: data regarding past events, e.g. day / month range (iii) low priority ad-hoc data: historical data, processed to complete / correct data periods that were found missing or with incorrect values, respectively. This metadata is fundamental for defining the correct processing priority for each type of file. Such priority is used both by the ETD engine for processing the input files and when performing the Integration and Loading steps using the already processed data. Thus, it is fundamental for the priority metadata to be propagated into the IL phase, through metadata, for each produced data delivery. Figure 4.22 depicts the task pipeline for the ETD engine that is applied in the processing of every provided file. First the file is split into sections (e.g. header, disclaimer, data area) through a set of predicates defined in the ETD language explained in Chapter 5. Associated to each section, there are a set of validation rules that enable the identification of possible changes on the file format. If an error is detected at any stage of the ETD pipeline the file processing stops and the error is reported to the administrator (e.g. an email is sent with the input file and FFD that raised the exception as attachments). If no sectioning violation is found, fields – single values and tabular - are extracted from the defined sections and validated according to its expected data types and minimum / maximum boundary values. If no data violation is detected and if a missing value representation (e.g. –1, -9999) exists, missing values are replaced by a user-defined value. If no missing value representation is available this step is skipped. Next, transformations are applied. For each transformation first the data types for the inputs are analysed and only if valid, the transformation is executed. When all transformations have been performed, the Data Deliveries are executed one by one (a File Format Definition may contain multiple data delivery references, possibly for different types of data) and delivered to the Data Delivery Interface web service. For each, if the data volume to be delivered is found to be considerable (e.g. over 1000 data entries) the data delivery is split into minor data deliveries within this boundary value (Figure 4.21). Otherwise the massive data delivery would not be scalable due to memory constraints.

Figure 4.21: Data Delivery package size - 84 -

ETD Engine

Sectioning

Sectioning is Valid?

[No]

[Yes]

Field Extraction

Fields are Valid?

[No]

[Yes] Apply Missing Value Representation

Valid Transformation Inputs?

[No]

[Yes]

[No]

Apply Transformation

All Transformations Applied?

[Yes]

Data Delivery

[No] All Data Deliveries Processed?

[Yes]

Report an Error

Figure 4.22: ETD Engine tasks pipeline

- 85-

Data Processing Module

4.5 Data Delivery Interface Concluding the Extraction and Transformation steps for processing an input file, the ETD engine delivers all processed data in XML format to a generic Data Delivery interface. This interface is responsible for receiving the processed data and applying / forwarding it to the Integration logic, responsible for the construction of a unified view combining multiple data deliveries together (possibly from different data sources) that is Loaded into an application or data repository afterwards. Each data delivery has an unique identifier associated to it, corresponding to a serial number that is dynamically generated when the ETD Engine performs a delivery to the Data Delivery Interface. Deliveries are performed over a row-oriented format, where each row corresponds to an event measure in time. In each row, besides the timestamp and measurement value columns, other values may be associated with the entry (also represented as columns). Each column contains a textual metadata descriptor (i.e. column header) that clearly identifies which values shall be available in that column. Further, also associated to a row entry an External Identifier (EID) may exist. This identifier can be defined manually, imported from a file (e.g. a CSV file with an identifier value and a description) or imported from the Metadata Repository as metadata. In the two cases where the identifier is not entered manually for a data delivery, after performing the import action, the user is asked to select to which identifier(s) does the data delivery refers to. A data delivery may refer to a single EID parameter that is kept constant over the EID Column (Table 4.1) or to multiple EID parameters (Table 4.2). In the later case, the EID column values result from a mapping operation that replaces extracted / transformed data values by EID references. Table 4.1: Data Delivery format for a single EID EID Column

Column 1

Column 2



Column N

EID 1

Value 1, 1

Value 1, 2



Value 1, N

EID 1

Value 2, 1

Value 2, 2



Value 2, N











EID 1

Value M, 1

Value M, 2



Value M, N

Table 4.2: Data Delivery format for multiple EID (mapping) EID Column

Column 1

Column 2



Column N

EID 3

Value 1, 1

Value 1, 2



Value 1, N

EID 7

Value 2, 1

Value 2, 2



Value 2, N











EID 2

Value M, 1

Value M, 2



Value M, N

- 86 -

Data Delivery Interface The Data Delivery interface has been implemented as a web service that can be accessed remotely, holding a single method: public boolean deliverData (String delivery) Although the deliverData web method receives as argument a string value, this must be a valid XML document according to the XML Schema depicted in Figure 4.23.

Figure 4.23: Generic Data Delivery schema Each DataDelivery is composed by a Metadata and Data parts. The metadata part describes the data structure present in the data delivery: o

DeliverySerial: A unique serial number that is associated to the data delivery. This value can be used for traceability purposes, determining to which input file a data delivery belongs to;

o

DeliveryDate: A timestamp for when the data delivery has occurred;

o

DataNature: For defining data deliveries that refer to a common format, used multiple times in the creation of FFDs, it is possible to define data delivery templates. These templates (addressed in Chapter 5) hold the structural declaration of a data - 87-

Data Processing Module delivery for a specific class of parameters (e.g. SC Parameters, SW Parameters) following a common structure. Thus, a template metadata declaration is equivalent to an empty data delivery where all the column descriptors have been already instantiated. Having available the DataNature metadata at the Integration level and considering that the template names are known by both the domain and computer science experts, this field clearly identifies the type of structure that the data delivery follows, i.e. it is not required to inspect each column descriptor to infer the data delivery structure at the IL phase; o

ProcessingType: Classifies if the data corresponds to a real-time, summary or adhoc data delivery;

o

StartDate and EndDate (Optional): Indicates the minimum and maximum dates for which the temporal parameters (EIDs) data refers to, establishing a time frame;

o

EIDs: A sequence with all the external identifiers involved in the data delivery;

o

ColumnHeaders: A set of textual descriptors for the data columns to be delivered.

Data is delivered in a row-oriented format (rows versus columns), with the parameter value present in the Col element, as depicted in Figure 4.24. value1 value2 value3 value4 Figure 4.24: Example of Data element contents

4.6 FFD Editor The FFD Editor application enables the creation, debugging and test of FFDs, as their submission to the supporting Metadata Repository, for further use by the ETD engine. This graphical application enables a transparent mapping between the declarative XMLbased FFD language and a set of graphical gestures and representations that enable a non computer-science expert to specify ETD actions. Both the ETD engine and FFD Editor components share the same code for processing each provided file. In this way, the accomplished results while using the FFD Editor application will be the same during the processing phase with the ETD engine. Since a close parallelism exists between the FFD Editor graphical application and its supporting FFD language (core components of the proposed ETD approach), Chapter 5 will be dedicated to these two technologies.

- 88 -

DPM Console

4.7 DPM Console The DPM Console is a graphical application that enables metadata management, monitoring and control of FR and ETD engines. As depicted in Figure 4.25, the DPM Console interacts with the entire ETD pipeline, using the web methods made available by the FR and ETD web services for controlling, providing status and logging information.

ETD Engine

FR Engine

ETD Engine

DPM Console

ETD Engine

Figure 4.25: DPM HMI interaction with FR and ETD engines At application start (or whenever a new ETD Engine is defined in the application), the DPM Console subscribes the logging information made available by the component (Figure 4.26).

Subscribe logging information

DPM HMI is running?

[No]

[Yes] Retrieve last logging information

Update logging component

Wait 5 seconds

Figure 4.26: DPM HMI logging subscription mechanism Information is acquired asynchronously (every 5 seconds by default) on invocation by the DPM Console, which is executed cyclically until an explicit termination occurs. A similar

- 89-

Data Processing Module scheme is used for retrieving the status 14 and logging information present in the FR Engine component. This tool enables the visualization of FR and ETD engines configurations, perform modifications and metadata synchronization with the Metadata Repository. In a similar way, it is possible to define new Data Service Providers and Provided Files metadata instances, as well as, editing and flag them for removal. Figure 4.27 15 provides an example for the Lomnicky Data Service Provider [76].

Figure 4.27: DPM Console - Data Service Provider metadata Using the DPM Console, it is possible to define the execution status (i.e. started / stopped) for each Data Service Provider, related Provided Files and even for the FR and ETD engines. Associated to each created FR or ETD engine, there is a logging visual component (Figure 4.28) that depicts the executing actions at the respective web services. In addition to an online view of the engine execution tasks, one is possible to filter the logging data (e.g. by date intervals or data service provider).

14

Describes if a Data Service Provider or Provided File is Active for data retrieval or not.

Such information is mapped into specific icons and to a colour scheme. 15

In the scope of the SESS system, ETD engines where known as FET (File Extractor and

Transformer) engines. - 90 -

Log Analyser Finally, the DPM Console component also enables the selection of provided files for adhoc loading. The user must select which provided files should be processed by DPM, the FFD instance and its version, and the ETD component that will perform the data processing. The ETD component selection is quite important since an ad-hoc loading may contain a considerable amount of provided files to process. So, having a dedicated ETD component for ad-hoc loading is recommended, in order not to overload the real-time ETD components (if any).

Figure 4.28: DPM Console – logging area

4.8 Log Analyser The Log Analyser is a graphical component for displaying and querying logging information. This graphical component can be used either as stand-alone application or as a graphical component that is included in the DPM Console application. When executed as a stand-alone application, the user must select a previously created log file. As a consequence, the contents of the log file are loaded into the Data Section and the filters Tag, Who and Operations are refreshed with the existing values in the log file for those columns. When

incorporated

in

the

DPM

Console

application,

the

component

receives

asynchronous log events, both from the FR or ETD engines, which are shown to the user depending on the selected filter (if any) - Figure 4.28.

- 91-

Data Processing Module In either operating mode, both the Toolbars and Filtering area presented in Figure 4.29 are made available to the user. A filter is composed by the conjunction (i.e. and operator) of an undefined maximum of predicates from six available fields: o

Free Search: Searches a string or regular expression in all the columns of the Data Section table;

o

Start Date: Filters all log entries by the Time column (Start Date < Time Column);

o

End Date: Filters all the log entries by the Time column (End Date > Time Column);

o

Tag: Filters all log entries by the Tag column;

o

Who: Filters all log entries by the Who column;

o

Operations: Filter all log entries by the Operations column.

Figure 4.29: Toolbars and filtering area

4.9 Summary This chapter presented the Data Processing solution as whole, based on the proposed ETD + IL approach. Both the technologies used in the development, as well as, the architectural design, have been addressed with a special focus on technology and component interaction. Then, each individual DPM application was described in terms of functionalities and main data flows. The next chapter (Chapter 5) is dedicated to the File Format Definition (FFD) language and File Format Definition Editor (FFD Editor) graphical application. Although both technologies have been addressed in this chapter, they have not been presented in great detail and due to their relevance for the thesis, they will be further explained. First, an abstract model for the FFD language is presented, followed by its operationalization using XML technologies. Next, the FFD Editor application is introduced, focusing on the human / machine interaction for defining the three ETD steps.

- 92 -

Chapter 5 The File Format Definition Language and Editor The fifth chapter is dedicated to the File Format Definition (FFD) language and File Format Definition Editor (FFD Editor) graphical application. First, an abstract model for the FFD language is presented, followed by a description on how the language has been implemented using XML-based technologies. Next, the FFD Editor application is introduced, starting with a general overview of the application’s graphical organization, followed by an explanation on how the three ETD steps are instantiated seamlessly to the domain user. Due to its complexity (derived from the data normalization process) graphical operations related with the Extract activity are explored in higher detail. Finally, some considerations are presented regarding the FFD language expressiveness and extensibility.

- 93 -

The File Format Definition Language and Editor The ETD+IL solution relies on declarative assertions that identify the operations required for ETD processing. Thus, knowledge represented at a declarative level can be analysed and shared between domain experts. Further, the declarations can be automatically computed in the detection of usage patterns, promoting language extension and evolution. For instance the commonly used creation date pipeline “Year Field” appended to “-”, appended to “Month Field”, appended to “-”, appended to “Day Field”, appended to “ ”, appended to “Hour Field”, appended to “:”, appended to “Minute Field”, appended to “:”, appended to “Second Field” comprising 10 append transformations has been replaced by a single transformation that receives all six date / time constituents as inputs and enables the customization of the date / time separators (e.g. “-”, “:”). These declarative definitions are stored in ETD scripts named File Format Definition (FFD) and are compliant with a pre-defined language. FFD contents are directly dependent on the text file format, such that a one-to-one association exists between a file format and an FFD. The definition of the FFD language followed two steps. First, an abstract model with the main functionalities supported by the language was defined in parallel with a semi-formal grammar. The construction of this grammar, although partial (mostly related with the Extract activity), allowed defining the main operators for the sectioning and field definition actions. Second, the language was rigorously implemented using XML technologies. XML Schema and XML have been selected for the definition of the declarative language and FFD instances, respectively, since they are highly known W3C standards enabling a clear and non-ambiguous representation of domain knowledge. Although the declarative FFD language presents itself closer to the domain user (at least when considering the source code alternative), due to the XML technical details, this solution is still not suitable for non-computer science experts. An abstract layer is required to be on top of the FFD language, masking all technical details and make them seamless to the user that should be limited to perform gestures (i.e. graphical interaction) using a dedicated editor. The FFD Editor (FFDE) is a graphical application for the creation, edition, test and debug of FFDs. The creation of a new FFD is based on annotations over a sample text file, following four main phases (that may be iterated if required): the three ETD steps – Extraction, Transformation and Data Delivery – and a final Validation step.

5.1 The File Format Definition Language The FFD language that provides support to the ETD approach has been non-ambiguously represented using the XML Schema technology. For each class of input files, an XML FFD instance is created, holding the ETD instructions to be applied in the data. Each XML instance must be conformant to the XML Schema language, in order to be correctly interpreted by the ETD Engine. The language definition followed a two-level specification. First, a high-level model has been derived that identifies the main functionalities and tasks (as well as their ordering) for ETD, but abstracting implementation details.

- 94 -

The File Format Definition Language Next, the language was represented using XML Schema. This definition has been performed iteratively and was continuously refined throughout the implementation of the FFD Editor application. During this period all the knowledge that could be customized by the user, was placed in the FFD and not hard-coded at the source code level. The proposed language was inspired on a previous declarative approach to ETL [42], also based on XML technologies. Taking this language as starting point, it was refined and expanded according two factors 16: (i) operationalization with the FFD Editor and (ii) addition of new features. Regarding the language operationalization, features like Processing Priorities and the file Sample used in the graphical specification have been included. Multiple new features have been added: e.g. data quality mechanisms (like data typing and validation rules) as well as all Data Delivery logic (specific to this thesis) that replaced the previous traditional Loading logic.

5.1.1 Model The FFD model provides a first sketch for the language definition and can be depicted in Figure 5.1. The model primarily highlights the division of the three ETD activities, where the outputs of each step (e.g. Extraction) are taken as inputs for the following step (e.g. Transformation). In the case of the Data Delivery activity, data that can be considered as input is not only limited to the previous step (i.e. Transformation) but can be expanded with references to extracted fields resulting from the Extract step.

Figure 5.1: The File Format Definition model Initially, given a sample Input File, the user specifies a set of Section Definitions. As a result of applying the Section Definitions to the Input File a Sectioned File is attained. Since Field Definitions can only be defined in the context of a single section, applying the Field Definitions to the input Sectioned File will result in a Field Partioned File. At this point all the Extraction operations have been applied to the initial Input File.

16

A complete reference of the FFD language is presented in the following sections. - 95-

The File Format Definition Language and Editor Transformation Definitions are applied to the Field Partioned File (output from the Extraction operations), resulting in a Transformed File. Finally, Data Delivery Definitions are applied both to the outputs of the Extraction and Transformation steps, namely the Field Partioned File and Transformed File outputs, respectively, resulting in a Data Delivery File. This file constitutes the final output of all ETD output actions. Figure 5.2 presents some general assumptions followed in the creation of a FFD 17. The first three statements refer to the definition of a file as non-empty sequence of semistructured text, which is then defined as a non-empty sequence of lines. A line is a simple non-empty sequence of ASCII characters. The remaining statements are related with the sectioning procedure. At least one section definition must be established for each input file, and the section’s start and end boundaries must refer to line numbers within the file length. For all defined sections, the start boundary line number must be always lower than the end boundary line number.

File = non empty sequence of Semi-Structured Text Semi-Structured Text = non empty sequence of Lines Line = non empty sequence of ASCII characters Sections Definition = non empty sequence of Section Definition For each Section defined in Sectioned File: Begin Line > 0 AND End Line > 0 AND Begin Line < End Line Sectioning Algorithm (Section Definitions, File) Æ Sectioned File

Figure 5.2: General assumptions Due to the complexity of the Extract phase (where data can be de-normalized and presented in multiple ways) this activity was split in two, following a common divide-toconquer approach: sectioning and field definition. Input files commonly have more than one area / part, with different contents and / or organization. Classical examples of such areas are: comments, disclaimer, footnote, metadata (e.g. about the file creation or about the data itself) and data areas (where data can be presented as a table or following any ad-hoc proprietary format). Depending on each domain, these areas may be optional, repeated, appear on any order and with domain variant contents. These areas (i.e. sections) are easily identified by a domain expert and usually follow some kind of rule (e.g. a metadata section that starts in line 3 and ends in line 7) or can be identified graphically (e.g. a comment section is comprised by all the lines that start with a “*” character). Analysing the section definition statements presented in Figure 5.3, both start and end section boundaries can be defined in one of four ways. The start boundary line can be

17

The statements presented in this section represent a simple first approach for

identifying the main conditions, definitions and algorithms required for the sectioning task. Although represented in a non-ambiguous way, no formal language and / or syntax rules have been followed in this specification. - 96 -

The File Format Definition Language defined according to the file start (line number = 1), to a line number comprised within the file length (line number = N), to a line where a specific pattern is found (line number = Pattern (line)) or after another section end (line number = End Line (S - 1) + 1). In a similar way, the end boundary line can be defined according to the file end (line number = Size(File)), to a line number comprised within the file (line number = N), to a line where a specific pattern is found (line number = Pattern (line)) or before another section start (line number = End Line (S + 1) – 1).

Section Definition Begin Line -

(S) such that Position (Line) is determined by: begin of file: Position (Line) = 1 integer value: Position (Line) = n string holding a pattern: Position (Line) = Pattern (Line) previous section end: Position (Line) = End Line (S – 1) + 1

End Line such that Position (Line) is determined by: end of file: Position (Line) = Size (File) integer value: Position (Line) = n string holding a pattern: Position (Line) = Pattern (Line) next section start: Position (Line) = Begin Line (S + 1) - 1

Figure 5.3: Section definition Finally, Figure 5.4 presents a simplified version of the sectioning algorithm, used for establishing the values for the start and end boundary lines for each section. First, all defined sections are iterated and for the boundary delimiters that have been defined through an absolute position (i.e. begin / end of file, specific line number or pattern related) the line number for each delimiter is established. Then, the same process is repeated but for those boundary delimiters that have been defined via a relative condition (i.e. before next section start or after previous section end). This way, independently from the type of boundary condition, both start and end delimiters are defined according to a specific line number.

For each Section Definition in Sections Definition If (Begin Line is absolute position) then Mark Start Section Line in Sectioned File with Section Definition Begin Line If (End Line is absolute position) then Mark End Section Line in Sectioned File with Section Definition End Line For each Section Definition index in Sections Definition If (Begin Line is a relative position) then Mark Start Section Line in Sectioned File with Section [index - 1] End Line + 1 in Sectioned File If (End Line is a relative position) then Mark End Section Line in Sectioned File with Section [index + 1] Start Line – 1 in Sectioned File

Figure 5.4: Sectioning algorithm Once all the sections with relevant data or metadata have been identified, follows the specification of data fields. Fields are created within the scope of a single section and can be considered local to it. At this step it was found that two types of fields are required: o

Single value: Captures a single datum (e.g. the author’s name, a timestamp);

o

Table: A bi-dimensional matrix that captures multiple columns of data (e.g. the temperature for all European cities for all the days of a given month).

- 97-

The File Format Definition Language and Editor Once the field definition step is concluded, the Extract activity is considered completed and the field outputs can be used in the Transformation process. Although the Transformation step is optional, it is usually required, since data is rarely ready for direct data delivery. Depending on data specificities, transformations may be organized in one or multiple transformation sequences (i.e. pipelines) where the outputs of one transformation are feed as input for the next transformation, and so on. Finally, follows the definition of the structure in which data shall be delivered, as well as the identification of relevant data for the delivery. Two types of data can be used for defining a data delivery: (i) extracted field data (e.g. single value, table column) and (ii) transformation outputs. At the model definition level, no specificities have been addressed, regarding the actual implementation for each ETD step. These will be presented in the next section.

5.1.2 XML Schema Implementation This

section

presents

the

FFD

language

operationalization

using

XML

Schema

18

technology . Figure 5.5 presents the root node for the FFD language with its direct child elements. Elements marked with a dashed frame (e.g. Fields, Transformations, DataDeliveries) are considered optional 19, i.e. they may not be present in the created FFD instances, while the remaining elements are mandatory. If an optional XML element is not instantiated, the entire element hierarchy bellow (if any) is also not instantiated. As

previously

explained

in

Section

4.3.1,

the

identificationElementsGroup

and

documentationElementsGroup elements are required by the Metadata Repository application, which is suggested to support the ETD approach presented in this thesis 20. While the identificationElementsGroup refers to elements that uniquely identify an instance

and

provide

a

documentationElementsGroup

description contains

for

it

information

(e.g.

Name,

regarding

ShortName),

the

creation

of

the the

metadata instance (e.g. Author, CreationDate, Status and Comments). The Versioning element contains the time span to which the FFD instance is valid. Since the structure of an input file may vary in time, multiple FFDs may have to be created, one for each different input file structure. The Versioning element contains a mandatory element StartDate that identifies the start date for which the FFD is valid and an optional EndDate element defining the ending date for the FFD validity period. If the EndDate

18

Complete FFD specification is available at http://raminhos.ricardo.googlepages.com/FFD.zip

19

Fields, Transformations and DataDeliveries are considered optional since the FFD Editor

may create partial FFDs (without defining all ETD steps) to be completed at a latter stage. In all these cases the FFD instance shall be compliant with the FFD language. 20

Otherwise, if the Metadata Repository is not considered as part of the ETD solution

these two elements can be removed from the FFD language definition. - 98 -

The File Format Definition Language element is not instantiated, then that FFD corresponds to the last known input file format (possibly used for data processing at the current time).

Figure 5.5: The FileFormatDefinition root element The General element contains global metadata settings that are applicable to the entire FFD

and

is

comprised

by

two

elements:

TestedWithMultipleInputs

and

MissingValueRepresentation. The TestedWithMultipleInputs element acts as a Boolean flag and identifies if the FFD generality has been tested with multiple input files, validating the specification initially performed on a single sample file (whose contents are also available at the FFD under the Sample element). This way the specification is tested to determine if it is general enough or if it is overfeted to the Sample file specifics. The MissingValueRepresentation element contains a string value that is used for representing missing values when delivering processed data to the IL phase. The first step for defining the Extraction procedures is to define the existing sections in the input file (as previously discussed at the FFD Model section). Two types of sections can be used in this specification: contiguous and delimited. Contiguous sections are defined as a set of contiguous lines that share a common pattern condition (e.g. Starting Pattern, Containing Pattern or Ending Pattern), where Pattern can

- 99-

The File Format Definition Language and Editor be understood as one or more characters (e.g. “!”, “*_”) or a regular expression (e.g.“\d{4}\s+”). Delimited sections (Figure 5.6) are defined by the enclosed lines between the section start delimiter and the section end delimiter. Each section delimiter can be defined using three types of conditions (making a maximum of nine possible combinations for the definition of a delimited section). Independently from the definition chosen by the user, the start delimiter must always appear first than the end delimiter. Also, and between the two delimiters, at least one line must be defined (empty sections are not allowed).

Figure 5.6: The Delimited element A start delimiter can be defined as: o

Relative: To the file start or to a previous section end;

o

Line Number: Starting at a specific line number;

o

By Content:

Starting at the first line that matches a Starting with / Contains /

Ending with textual pattern. An end delimiter can be defined as: o

Relative: To the file end or to a next section start;

o

Line Number: Starting at a specific line number;

o

By Content: Starting at the first line that matches a Starting with / Contains / Ending with textual pattern.

Once sections are defined, the definition of fields within these sections through the Fields element (a sequence of Field elements), should follow. Each Field is characterized by an unique Name and it is associated to a section through a SectionIndex element. Also, common to each field definition, is the starting line within the section text. This line

- 100 -

The File Format Definition Language number (offset) indicates the first line of the section text from which data is considered in the definition of the field contents. A SingleValue (Figure 5.7) can be defined in one of two ways: o

Delimiters: The user specifies prefix and / or suffix strings that will act as boundaries for determining the single field value;

o

RegularExpression: The user specifies a regular expression that captures the single field. If the regular expression captures more than one value, only the first result shall be considered.

Figure 5.7: The SingleValue element Single values (and table columns) may have an optional MissingValue representation. During the processing, missing values are replaced by a custom value defined by the user (e.g. “” - empty string) indicating that the value is missing (e.g. due to a sensor failure). The definition logic of a Table field is depicted in Figure 5.8. A table field can be defined through one of three ways: o

FixedWidth: Defines a set of column breaks (numeric positions in the text) that specify boundaries for each column. A minimum of one column break must be defined (resulting in a table with two columns). No maximum number of column breaks is established, so N column breaks can be defined (resulting in a table with N + 1 columns);

o

CharDelimited: Defines a delimiter character that specifies the boundaries for each column;

o

RegularExpression: Specifies a regular expression that captures all table columns. Each group 21 in the regular expression is mapped to a table column.

21

Defined by a regular expression statement between brackets. - 101-

The File Format Definition Language and Editor Examples: o

1 group = 1 table column: ..(\d{2})

o

2 groups = 2 table columns: ..(\d{2})abc(.{3})

o

3 groups = 3 table columns: (\d{2})(\d{3})\s(\d{4})

Each table column has a label description that is represented by a pair (ColumnIndex, Description). Both table and single value fields share the same data typing / validation scheme. For table fields, multiple data typing and validation is required (one for each table column) while for single values, a single reference is enough. Associated to each data type, there are a set of validation rules (depending directly from the data type). Figure 5.9 presents the Validation definition, expanding the four possible data types and displaying the associated rules.

Figure 5.8: The Table element One of four data type values is always associated to each single value or table column: o

Unbounded: No data type definition;

o

String: Textual values (Minimum Length and Maximum Length validation rules);

o

Date: Date values (Minimum Date and Maximum Date validation rules);

o

Numeric: Numeric values (Minimum Value and Maximum Value validation rules).

- 102 -

The File Format Definition Language

Figure 5.9: The Validation element The Transformations node is comprised of a sequence of one or more Transformation elements, forming one or more Transformation pipelines. Each transformation is defined according to the schema presented in Figure 5.10, only containing the logical part of a transformation (the graphical representation is stored in the GraphicalDisplay element that will be presented latter in this section). A list with all the available transformations is available at the Annex Available Transformation Operations. Since transformations have been implemented as plugins (explained in the next section) the Transformation node schema follows a generic representation: o

Type: The type of transformation (e.g. AppendConstant, Merge);

o

Name: A unique name, defined by the user that identifies the transformation;

o

Inputs: A set of transformation inputs (if any). Each input is defined through a pair (Name, Reference), where the Name identifies the input contents and the Reference identifies the extracted field / transformation output, containing the data;

o

Parameters: A set of references for the transformation parameters (if any). Each reference if defined through a pair (Name, Value), where the Name identifies uniquely a parameter (equivalent to the value of the Parameter name attribute used for in the definition of a plugin metadata transformation - Figure 5.16;

o

Outputs: A set of transformation outputs. Each output is defined through a triplet (Name, Reference, External), where the Name identifies the output contents and the

- 103-

The File Format Definition Language and Editor Reference identifies the transformation output, containing the transformed data. The External element (Boolean flag) identifies which output columns can be handled for data delivery purposes. Marking a transformation output not visible (default value), makes the output handling only local at a Transformation context level (e.g. as input of another transformation). In this case the value is not propagated to the Data Delivery context and is not directly part of any data delivery. Only the External elements marked as true can be handled at the Data Delivery context.

Figure 5.10: The Transformation element The final Data Delivery step for ETD is defined by the DataDeliveries node, which contains a set of DataDelivery elements. Each DataDelivery is defined by three elements: a Template, an Identifier and a Data element (Figure 5.11).

Figure 5.11: The DataDelivery element

- 104 -

The File Format Definition Language Figure 5.12 presents the Template element definition. Each template has a user-defined name and refers to a delivery format that has been previously agreed between domain experts (ETD knowledge) and computer science experts (IL knowledge) and is identified by the TemplateGroup element (e.g. SW Parameters, Volcano Events). A template also defines the names for the data delivery columns and identifies the data type for each column. Furthermore, if data is time-oriented, then the DateColumnIndex identifies the index within the DeliveryColumns that contains the date information to be delivered 22.

Figure 5.12: The Template element Figure 5.13 presents the Identifier XML Schema for a data delivery definition. An identifier can be specified in one of two methods: o

By an identifier reference that is kept constant in all lines of a data delivery (exemplified in Table 4.1);

o

By a mapping function (Figure 5.14) that defines the identifier for a data delivery row depending on the SourceValue present in a specified column (e.g. mapping the parameter value N134XR to the global identifier GID_001_000000001). Using the mapping

function,

the

same

data

delivery

may

contain

multiple

identifiers

(exemplified in Table 4.2).

Figure 5.13: The Identifier element

22

If containing a valid index then the date contents of the refereed column will be used

to identify the StartDate (minimum date value) and EndDate (maximum date value) when performing an actual XML data delivery (Figure 4.23). - 105-

The File Format Definition Language and Editor The Data element simply comprises a set of references to column names (either referring to extracted fields or transformation outputs), which will form the row-oriented data delivery. The Processing element contains information about the Priority (i.e. Low Priority, Nominal Priority and High Priority) used in data processing, as well as, the ProcessingType (i.e. Realtime, Summary, Ad-hoc). According to these two parameters the thread priority used for processing is calculated, where the ProcessingType value is mapped to an offset value given the base priority.

Figure 5.14: The MappingIdentifier element An example of the thread priority definition process is presented in Table 5.1. Table 5.1: Thread priority definition example Priority

Processing Type

Thread Priority

(Base Priority)

(Offset Priority)

(Base + Offset Priority)

Low Priority (1)

Ad-Hoc (0)

1

Low Priority (1)

Summary (1)

2

Low Priority (1)

Realtime (2)

3

Nominal Priority (4)

Ad-Hoc (0)

4

Nominal Priority (4)

Summary (1)

5

Nominal Priority (4)

Realtime (2)

6

High Priority (7)

Ad-Hoc (0)

7

High Priority (7)

Summary (1)

8

High Priority (7)

Realtime (2)

9

Finally, the GraphicalDisplay element (Figure 5.15) contains graphical metadata used by the FFD Editor for storing user preferences (regarding transformations’ display and organization). Each transformation is represented graphically as a node in a graph, containing a textual string in the node (i.e. Name), a position in the graph given by the XCoordinate and YCoordinate positions, a Width and a Height.

- 106 -

The File Format Definition Language

Figure 5.15: The GraphicalDisplay element

5.1.3 Transformation Plugins Since transformations are highly dependent on data nature / format of the input files, the mechanism for adding new transformations to the FFD language (and Editor application) uses a plugin philosophy, easing the introduction of new transformations. In this way, all transformation code (either logical or graphical) and metadata are completely decoupled. A transformation plugin consists basically in three files: one Java class containing the logic for the transformation, one Java class containing the graphical visualization for user interaction and a metadata file describing the transformation itself. Figure 5.16 presents the XML Schema followed by all transformation metadata files. Each transformation is characterized by the following information: o

Name: A value that uniquely identifies a transformation (i.e. primary key). In a FFD instance the value present at the Type element of a Transformation (Figure 5.10) is a reference to this value (i.e. foreign key);

o

Scope: Indicating whether the transformation scope is restricted to a single column (the one being transformed) or to multiple table columns;

o

Description: A description used as tool tip text;

o

Colour: A background colour (given by the three Red / Green / Blue – RGB components);

o

IconPath: Path to the image representative of the transformation (used as icon);

o

Inputs: The name and data type (i.e. Integer, String, Date, Unbounded, Double) for each transformation input (if any);

o

Parameters: The name and data type for each transformation parameter (if any);

o

Outputs: The name and data type for each transformation output;

o

LogicalClassName: Full package name for the Java class containing the logic for the transformation (for dynamic class loading);

- 107-

The File Format Definition Language and Editor o

GraphicalClassName: Full package name for the Java class containing the graphical / user interaction for the transformation (for dynamic class loading);

Figure 5.16: Transformation definition

5.2 The FFD Editor Although the FFD declarative approach is one step forward in the representation of domain knowledge (when compared to the creation of source code), it is still not a feasible solution to be used by a domain user. Instead, this thesis proposes that the FFD language should be used on two perspectives: (i) As a way to strictly represent the main objects, relations, constructs and operations required for processing semi-structured textual data. (ii) By representing the data processing knowledge using XML technologies,

- 108 -

The FFD Editor a technical operationalization of the language is attained (knowledge container) that can be automatically interpreted by external programs. The FFD Editor is a graphical application that enables the construction and edition of FFD metadata files. The editor does not work directly over the FFD declaration but over a semi-structured text file sample, representative of a class of files. Through the live edition on data based on a set of templates, wizards and graphical gestures performed by the user, the FFD specification is created seamlessly. All the XML specificities are hidden from the domain user, which does not have any direct contact with the FFD specification. Besides masking the specificities of the supporting language, the FFD Editor also hides the complexity of data processing tasks, providing simple graphical interfaces from which the data processing statements are automatically derived (e.g. the automatic generation of regular expressions based on graphical gestures and wizards). The FFD Editor’s graphical layout is mainly composed by three tabs, one for each step of ETD: Extract, Transform and Data Delivery (Figure 5.17). Each FFD is based on a sample file where the user specifies, which data shall be extracted, which transformations are required and what information (and its structure) is relevant for delivery. Upon user specification the operations are automatically applied to the sample file and visualized by the user. The user can then perform any further modifications as required. All three ETD steps / tabs are highly correlated (working as a functional pipeline): the Extract outputs can be used in Transform and Data Delivery, and the Transform outputs can be used in Data Delivery.

Figure 5.17: The FFD Editor ETD tabs For all three tabs, the same graphical layout was followed as depicted in Figure 5.18.

Figure 5.18: A graphical layout - 109-

The File Format Definition Language and Editor Each tab is mainly comprised of three areas (with minor changes specific to each ETD step): o

Previous Objects (identified as A): Objects from previous ETD steps (e.g. an extracted field in the Transform tab);

o

Current Objects (identified as B): Objects being created on the current ETD step (e.g. a section or extracted field in the Extract tab);

o

User Interaction (identified as C): Main user interaction area (e.g. marking sections or identifying fields in the Extract tab, a transformation pipeline in the Transform tab or the data delivery composer area in the Data Delivery tab).

Figure 5.19, Figure 5.20 and Figure 5.21 present the instantiation of the graphical layout for the Extract, Transform and Data Delivery tabs, respectively 23. The Extract tab is composed by two areas: (B) Area where sections and fields are represented as a tree and (C), the Extract panel where the user can draw sections and specify fields within them, by interacting with the sample file. The Previous Objects area (A) is not available since the Extract tab is the first step in the ETD chain.

Figure 5.19: Extract tab layout The Transform tab is composed by three types of areas: (A) Presenting the sections and fields previously defined in the Extract tab. (B) Displaying the transformations and visible transformation outputs (defined in the Transform tab) that will be visible in the Data 23

Further details regarding each tab are presented in the following sub-sections.

- 110 -

The FFD Editor Delivery tab. (C) The main Transform panel (on the right) where transformations are depicted as a graph and the transform definition panel (on the lower left), where the domain user can instantiate required arguments for defining a transformation. Finally, the Data Delivery tab is also composed by three areas: (A) Presenting the sections and fields defined in the Extract tab and the visible transformation outputs defined in the Transform tab. (B) Displays the data deliveries created in the current tab. (C) The main Data Delivery panel where the user composes and selects which data is relevant for defining a data delivery. In order to provide better extensibility, data transformations have been implemented following a plug-in architecture, as explained in Section 5.1.3. Based on the metadata files describing each available transformation, the toolbar on the Transform tab (Figure 5.20) is automatically populated.

Figure 5.20: Transform tab layout Java classes describing the transformations’ graphical layout are also dynamically loaded into the Transform tab at the (C) panel (on the lower left) and displayed as transformations are selected. The Java classes describing the transformations’ logic are loaded dynamically and applied by the ETD engine during processing. The FFD Editor comprises an ETD Engine for simulating the execution of the ETD statements defined in the graphical application. This engine is the same used by the ETD Web Service. This way the same behaviour is guaranteed, when processing a file either through the FFD Editor or ETD Web Service.

- 111-

The File Format Definition Language and Editor The FFD Editor interacts with the Metadata Repository as source and target for deploying and loading FFD metadata, respectively. This interaction is highly decoupled and the two components can be used independently. While in offline mode (without connectivity with the Metadata Repository) the FFD Editor can create / edit FFD metadata, that can be synchronized latter with the Metadata Repository when this is accessible.

Figure 5.21: Data Delivery tab layout A complete real example of a FFD set-by-step creation is comprised in the CD that accompanies this report (due to its length the example is not available in this document).

5.2.1 Menu Functionalities and General Metadata Besides the three ETD tabs, the menus File, Options and Actions are also available to the user. The File menu (Figure 5.22) enables general FFD load / save capabilities:

Figure 5.22: FFD File menu - 112 -

The FFD Editor o

Open FFD: Opens a previously created FFD. The FFD can be loaded directly from the File System or by connecting to the Metadata Repository directly. Depending on the user choice, an open file dialogue window or a Select FFD to load frame (Figure 5.23) will be presented to the user, respectively;

Figure 5.23: Open FFD… options o

New FFD: Enables the creation of a new FFD. The first step when defining a new FFD consists in selecting a sample file, which can be selected from the File System or from the File Retriever’s Cache file path;

o

Edit Metadata: Edits the general metadata (i.e. identification, authoring, versioning and data processing) directly associated to the FFD;

o

Deploy: Stores the current FFD in the Metadata Repository. Depending on the user selection, a new version for the FFD can be created or the last FFD version can be overwritten;

o

Save / Save As: Saves the current FFD as an XML file into the File System;

o

Exit: Terminates the FFD Editor application.

The Options menu manages the connection between the FFD Editor application and the Metadata Repository (i.e. definition of IP and port number). The Actions menu enables the FFD testing, applied to multiple input files (test set). This way the testing of the FFD definition is not only limited to the sample file used in the FFD creation and its generality can therefore be tested. After choosing the sample input file, the Extract panel is instantiated with the input file contents and the File Format Definition Metadata panel is presented to the user (Figure 5.24). The user must then enter metadata regarding: o

Identification: FFD unique name and description;

o

General: A textual value for representing missing values when delivering the processed data to the IL phase; - 113-

The File Format Definition Language and Editor o

Authoring: The author’s name and status 24 for the FFD metadata instance (i.e. Proposal, Accepted, Rejected);

o

Versioning: The time frame for which the FFD is valid for data processing (i.e. input files must refer to this period);

o

Data Processing: Defines the base priority for data processing (i.e. Low Priority, Nominal Priority and High Priority) and the type of data present in the file (i.e. Realtime, Summary or Ad-hoc).

Figure 5.24: File Format Definition Metadata form

5.2.2 Extraction The Extraction tab comprises the first step in the ETD chain and is the first contact between the domain expert and the input file. Depending on each input file format, a higher or lower degree of effort may be required to normalize the data into sections and fields, which will be handled afterwards in the Transformation and Data Delivery steps.

24

The status element is a requirement for managing metadata within the Metadata

Repository application. Since some metadata instances require a high level of domain expertise (e.g. physical measurements in the spatial environment) double-checking may be required. Thus, when created, metadata is instantiated by default as a Proposal and once re-checked it can be either Accepted or Rejected. - 114 -

The FFD Editor

5.2.2.1 Sectioning The first step in data extraction consists in the partition of the text file into nonoverlapping sections that identify different areas of data within the text file. Generally, these sections are easily identified since they share some common property (e.g. all lines starting with a given prefix or following a start / end delimiter condition). Each section is identified by a name and can either be delimited (where two boundary conditions determine the beginning and end of the section) or contiguous (defined by a common property, shared by a contiguous set of text lines). Delimited sections can be defined through absolute conditions as file start, file end, line number or the first line that starts, contains or ends a given string pattern. Besides absolute conditions, delimited sections can also be defined using relative conditions such as start section after previous section end or end section before next section start. Contiguous sections are defined through a sequence of lines that start, contain or end a given string pattern. Sectioning is performed either through graphical interaction with the sample file or via specific wizards (e.g. for defining pattern-based sections). In the wizard the user may define a set of validation rules that sections must comply in order to be considered valid. Through this validation mechanism, changes in the file’s format may be detected early at the sectioning phase: if the section can be optional, minimum and / or maximum number of lines present in the section, existence of a given pattern in the section start, middle or end. All file sections are initially created by a default section creation gesture (Figure 5.25), where the user selects a set of text lines that are marked as a section (if no overlap with other section occurs). By dragging the section boundaries interactively, it is possible to link a section boundary either to a file start / end or section start after previous section end / section end before next section start relative conditions.

Figure 5.25: Marking a text area for section creation A default section is delimited by nature and has its start and end delimiters following a line number condition, based on the first and last line selected by the user. However, if the start line corresponds to the start of file or the end line corresponds to the end of file, then this notation takes precedence over the line number condition (Figure 5.26). Besides being visible in the file content pane the new section will appear in the FFD tree (under the Sections node). The node icon presented for each section, identifies if the section is either Delimited or Contiguous. Whenever a section node is created or selected in the FFD tree, the corresponding section is highlighted (light-blue) in the file content pane. Further, depending on the node type selected in the FFD tree (i.e. delimited section, contiguous section, table field, table column field, or single value field), descriptive metadata information is presented in the lower-left panel of the Extract tab. - 115-

The File Format Definition Language and Editor

Figure 5.26: Default section creation For each sectioning type, two graphical symbols (arrow icons 25) represent the semantics for the section delimiters: o

Arrow pointing left: Indexed by line number;

o

Arrow pointing right: Indexed by string pattern (i.e. Starts With, Containing, Ends With);

o

Arrow pointing top: Relative to previous section end;

o

Arrow pointing top with line above: Relative to file start;

o

Arrow pointing down: Relative to next section start;

o

Arrow pointing down with line bellow: Relative to file end.

Each section delimiter can be mouse-dragged by left-clicking the section delimiter (that becomes red) and dragging it into a new location (Figure 5.27). In this process, the previous delimiter condition is lost (e.g. start section using a starting with pattern) and the section delimiter becomes indexed to line number.

25

The top rightmost arrow icon corresponds to start section delimiter, while the down

leftmost arrow icon corresponds to the end section delimiter.

- 116 -

The FFD Editor

Figure 5.27: Interacting with section delimiters Depending on the section delimiter condition, right clicking on it, may trigger a popup menu in order to change the delimiter condition. Figure 5.28 depicts a section (Second Section) that starts when another section (First Section) ends.

Figure 5.28: Transforming a line oriented delimiter into relative Right-clicking on the start delimiter of the Second Section will make a popup menu to appear, enabling a change from a line number oriented delimiter to a relative to previous section end. Selecting this change the section is updated as depicted in Figure 5.29.

Figure 5.29: A relative to previous section start delimiter Dragging a non-line based (e.g. relative) section delimiter will make the section delimiter line oriented (Figure 5.30). In order to set a By Content condition to a section delimiter, the user must mark the text line that contains the pattern used in the condition. After selecting the text and right clicking on it, a popup appears.

Figure 5.30: Transforming a relative delimiter into line oriented - 117-

The File Format Definition Language and Editor The user must then select if the text shall be used for defining the start or end delimiter and if the pattern condition will be used for establishing a delimited (Figure 5.31) or contiguous section (Figure 5.32).

Figure 5.31: Defining a section Start Delimiter based on a string pattern

Figure 5.32: Defining a Contiguous Section based on a string pattern Upon selecting the condition type, a Pattern Definition form is presented to the user (Figure 5.33), enabling the construction and testing of By Content definitions.

Figure 5.33: Pattern Definition form Patterns can be defined in one of two ways: o

Simple mode: Directly specifies a string in the Pattern text field (Figure 5.34 – left);

o

Advanced mode: Specifies a regular expression that captures a pattern (Figure 5.34 - right). The user may create the regular expression from scratch or may apply directly one of the regular expressions present in the Regular Expression Library. The regular expressions present in the library may also be used as a starting point for the definition of other regular expressions.

- 118 -

The FFD Editor

Figure 5.34: Applying a text pattern (left) or advanced regular expression (right) to an input file text After testing the string pattern / regular expression, a section is created both in the FFD tree, as well as, in the file content pane (Figure 5.35).

Figure 5.35: Contiguous Section after applying a simple text pattern Conducting a careful analysis, one can notice that the contiguous section icon is different from the delimited section icon, and that the arrow icons for the contiguous section point directly to the text, representing a By Content definition. For contiguous sections the arrow icons shall always point to the file content, since contiguous sections can only be defined through By Content conditions.

- 119-

The File Format Definition Language and Editor

5.2.2.2 Field Definition The identification of fields (Single Value fields and Table fields) in the previously defined sections corresponds to the second step in the extraction of relevant data from input files. Both fields can be defined from the start of the section or given an offset of lines within the section. Data quality mechanisms are available for both types of fields. To each single value or table column it is possible to associate a data type, a set of validation rules (e.g. minimum / maximum numeric value) and a missing value representation (common in scientific data files). Four data types are available (e.g. Figure 5.36): unbounded, numerical, textual and date. Each form presents a (Validation, Expression) pair table where the user specifies the values for each validation rule. If no value is defined for a validation rule, then this validation will not occur during the extraction process. Also present in each form, is the Missing Values representation feature. This optional representation can be associated to a table column or single field. During the extraction phase, all field data that matches a missing value representation is replaced with a custom-defined value (e.g. empty string “”). This customization is conducted globally at the input file level, and not at the Editor or field level, since it is expected that all the delivered data (at least within the same file) follow the same missing value representation when delivered to the IL phase.

Figure 5.36: Data types and validation rules (single value wizard) Field definition is initiated by a default field creation gesture and completed via a specific wizard (since field definition is usually more complex than sectioning). Both single value and table fields are created by marking in the file content pane, the line where the section data will start to be considered for the field. If a section line different than the first is selected, then a popup menu is presented to the user, otherwise the wizard is presented automatically. The popup menu merely confirms if the user wants to create a field considering the entire section text or from the selected section line forward.

- 120 -

The FFD Editor 5.2.2.2.1

Single Value

The first wizard step for creating a single field (Figure 5.37) consists on defining the type of single field from one of two possibilities: String Delimited or Regular Expression. Depending on this selection, dedicated tabs for each definition type are presented.

Figure 5.37: Single Value Wizard – Defining the single value type Considering a String Delimited selection, the form displayed in Figure 5.38 is presented. This form enables the definition of prefix and / or suffix values that act as string boundaries in the definition. Each string segment has a specific colour indicating its role in the definition: Orange (left-side) prefix boundary, Green (centre-side) single field value and Purple (right-side) suffix boundary.

Figure 5.38: Single Value Wizard – Prefix based (left) or prefix and suffix based (right) single value By dragging the line separator between each string segment the user can specify the location and content of either prefix or suffix values. Both prefix and suffix values may be - 121-

The File Format Definition Language and Editor defined for capturing a single value, only the prefix or only the suffix. If only the prefix value is defined then the single value shall be considered from the end of the prefix till the end of line. Otherwise, if only the suffix value is defined, then the single value shall be considered from the start of the line until reaching the suffix value. Instead, if the user selects the Regular Expression Single Value option then a form for defining regular expressions is displayed (similar to the form for defining section’s regular expressions - Figure 5.34 on the right). If the defined regular expression captures multiples string values, only the first one is considered as the single field’s value. Independently from the single value types, the final wizard step refers to the definition of data types, validation rules and missing value representation. Once the wizard is concluded, a new single field node is created in the FFD tree (under the section to which the field belongs to) and in the file content pane (Figure 5.39). Selecting a single field will make the visualization of the file content pane to change, focusing on the selected element and related metadata will also be presented in the lower left panel.

Figure 5.39: Single value representation 5.2.2.2.2

Table

Table fields contain one or more table columns, which can be defined through fix-width length, a regular expression or by specifying a column delimiter character that separates the columns (Figure 5.40).

Figure 5.40: Table Wizard – Defining the table type - 122 -

The FFD Editor Considering a Character Delimited selection, a form with the four most common column delimiter characters (i.e. Space, Tab, Semicolon and Comma) is presented. Further, this form also enables user defined delimiter characters 26. Clicking on the Apply button, the user can visualize and validate the attained results in a tabular format (Figure 5.41).

Figure 5.41: Defining a character delimited table However, if the user selects the Fix Width Table feature then the form displayed in Figure 5.42 (left side) is presented. In this form, the user interacts with the text area containing the previously selected section text, creating fix width separators by clicking and dragging column separators.

Figure 5.42: Table Wizard – Defining a fix width table with 3 columns (left) and defining a regular expression table (right)

26

All delimiter characters are exclusive one another, since only one character can be

considered as column delimiter. - 123-

The File Format Definition Language and Editor Separators appear in a two colour scheme (black and red), where the red colour indicates that the separator is currently selected (as opposite to black). As explained in the form, a click operation creates a new fix width separator, dragging the separator will make it change position and a double-click operation on a selected separator will remove it. Finally, if the user selects the Regular Expression option then the form depicted in Figure 5.42 (right side) is displayed. Pressing the Apply button will display the outputs of the defined regular expression in a tabular format for user visualization and validation. Independently from table type, the final wizard step refers to the definition of data types, validation rules and missing value representation (presented previously) for each of the table columns. Upon conclusion of the wizard, a new table field node is created in the FFD tree (under the section to which the field belongs to) and in the file content pane. A set of column nodes are also created bellow the table node. Depending on whether the user selects a table node or a column node, the visualization in the file content pane changes, focusing the selected element. Figure 5.43 provides an example of table selection and Figure 5.44 provides an example of column selection.

Figure 5.43: Selecting a Table node after creating a table field

Figure 5.44: Selecting a Column Table

5.2.2.3 Regular Expressions Independently from the type of section and field specifications both objects are represented internally as regular expressions. This internal representation is transparent

- 124 -

The FFD Editor to the end-user that only requires knowledge on a set of gestures / wizards for interacting with the graphical application. The usage of regular expressions increases substantially the text processing performance due to its powerful supporting libraries, much faster than traditional string operations. Figure 5.45 presents a mapping between simple By Content pattern 27 specifications defined in a sectioning wizard and its regular expression equivalents, automatically inferred by the application. These regular expressions refer to the Starting With (1), Containing (2) and Ending With (3) sectioning definitions. 1. (^EXPR.*\r\n) 2. (^.*EXPR.*\r\n) 3. (^.*EXPR\r\n) Figure 5.45: Mapping By Content sectioning definitions to regular expressions Single value representation is also mapped into regular expressions. Figure 5.46 presents the prefix / suffix definition for single values (1). Since field definition can have an offset within the section text, this is also codified in the regular expression statement (2) – skipping the first two section lines in the example. 1. PREFIX(.*)SUFFIX 2. .*\r\n.*\r\nPREFIX(.*)SUFFIX Figure 5.46: Mapping a single value definition with PREFIX and SUFFIX to a regular expression Regarding Table fields, both types of non-regular expression definitions are also mapped into regular expressions. Figure 5.47 presents a mapping example of a table with four columns taking the blank character 28 as column separator. (.*?)\s(.*?)\s(.*?)\s(.*?)\r\n Figure 5.47: Mapping a character delimited table to a regular expression Figure 5.48 provides a mapping for a fixed-width table with four columns (column length 4, 5, 6 and until the end of the file, respectively). An offset of five characters is present before defining the first table column. .{5}(.{4})(.{5})(.{6})(.*)\r\n Figure 5.48: Mapping a fixed width table to a regular expression Regular expressions can also be defined according to a library of regular expressions that can be used as is or as starting point for defining a regular expression that is not directly present in the library. This library is visible when in the advanced definition of By Content sections, single values and tables, and is wizard accessible for all three objects. Figure

27

Not a regular expression entered by the user.

28

Represented as \s in regular expression syntax. - 125-

The File Format Definition Language and Editor 5.50 displays the Regular Expression Library embedded in the wizard for the definition of a Contiguous section, as an example. The regular expression knowledge is present in a single XML file (example in the Annex Regular Expression Library (XML Instance)) conforming to the XML Schema depicted in Figure 5.49. This file can be edited and expanded with new regular expressions (when the FFD Editor is offline) and the changes will be made visible when the application is restarted.

Figure 5.49: Regular Expression Library concept Regular Expressions are presented hierarchically, following a simple two-level tree. At the first level a set of groupers (e.g. Date / Time) gather related regular expressions together under a common name. The leaf level contains the actual regular expression statements, a description (the tree node value) and a comment field that is displayed as footnote, when the regular expression tree node is highlighted (Figure 5.50).

Figure 5.50: Regular Expression Library (contiguous section wizard)

5.2.3 Transformation Transformations are required since, usually, extracted data fields need to cleansed or formatted in a different way before being delivered. Transformations can be seen as a sequential pipeline, where the outputs of a transformation constitute, most of the times, inputs for subsequent transformations. This way, data suffers a sequential transformation process. Transformations are represented graphically as an acyclic graph, that easily - 126 -

The FFD Editor identifies the role of each transformation within the transform pipeline. Each start node from the pipeline refers to an extracted field or constant value, while the remaining nodes represent data transformation operations. Connections between transformation nodes represent an output from a source transformation that is being used as input by a target transformation. Two types of transformations are available: column / single field and table oriented. In the first case the transformation will affect only one table column / single value (e.g. append a string to the end of each value of a selected column). In the second case the transformation will affect multiple table columns (e.g. deleting rows from table columns given a matching criterion). The transformation area (Figure 5.51) follows a classical approach with multiple transformation pipelines that are represented as graphs. Each graph node represents a transformation that is part of an internal library (displayed as a toolbar). Having selected a specific transformation in a graph it is possible to verify its correctness by performing a visual inspection of its data inputs and outputs in the Inputs / Outputs table area (bellow the graph area). Transformations require specific tuning metadata that must be defined by the user (e.g. for an AppendConstant transformation the user must define the constant to be appended and if the constant will be placed as a prefix or suffix).

Figure 5.51: FFD Editor’s Transformation step

- 127-

The File Format Definition Language and Editor

5.2.4 Data Delivery The first step in defining a data delivery consists on selecting the parameters to which the data delivery refers to. Depending on the parameters nature, a template structure for the data delivery is defined in the form of (field name, data type) pairs. Associated to each pair, the user must drag-and-drop either references to extracted fields or transformation outputs. Figure 5.52 depicts the six areas that comprise the Data Delivery panel: o

Extract Tree (upper left): A tree with all the extracted fields in the Extract step

o

Visible Outputs Tree (middle left): A tree with all the transformation outputs made visible in the Transform step;

o

Data Delivery Tree (lower left): A tree with all the created data deliveries;

o

Toolbar (upper right): Creation, edition and saving operations for a data delivery;

o

Template Area (middle right): A tabular template for data delivery definition. While the left column contains the column name for the data delivery field the right column can receive a data reference drag-and-drop by the user;

o

Preview Area (lower right): Contains a tabular preview of the data to be delivered.

Figure 5.52: FFD Editor's Data Delivery step Associated to the same input file and depending on its specific data contents, several data deliveries may be specified. Depending on the selected data delivery in the Data

- 128 -

The FFD Editor Delivery Tree its definition and preview data will be updated accordingly in the Template Area and Preview Area, respectively. When defining a new data delivery the Identifier Picker form is displayed (Figure 5.53). The user must then specify, which global identifiers are involved in the data delivery 29. Once the data identifiers have been selected, the Data Delivery Definition panel is initialized with an empty data delivery template as depicted in Figure 5.54, presenting all the columns for the selected template (visualized as rows). Dragging the External Outputs nodes to the respective table row (under the External Output Name column) performs the population of the template rows. After dropping the selected data reference in the template row, the corresponding column in the Delivered Data Table is populated with the data that the reference contains, acting as a data delivery preview. If the number of rows for the selected outputs is not the same, then empty cells will be added for the columns lacking the data rows.

Figure 5.53: Identifier picker form Besides producing a data delivery for a single parameter it is also possible to use an identifier-mapping feature. As a result, the Identifier Mapping form is presented, where the user can select which input values shall be mapped into the selected EIDs. For

29

All selected identifiers must follow the same template (i.e. correspond to the same

type of data). - 129-

The File Format Definition Language and Editor defining this mapping the user must select a data field (i.e. single value of table field column) present in the Extract Tree or transformation output from the External Outputs tree, which contains the source data to be mapped.

Figure 5.54: An empty template (SW parameters) with a parameter identifier defined

5.2.5 Validation As a final step, the user should 30 apply the FFD definition to a set of other input files in order to verify if the definition is generic enough. For this purpose the user must first select a test set comprised of input files from the same class, available in the File System. Once selected, the form displayed in Figure 5.55 is presented to the user. This form enables direct visual comparison of Extract contents between the sample file and one of the test input files 31, applying the current FFD definition to both the sample and selected test file. In case of structural error while applying the FFD to the test file (i.e. retrieving a section or field from the text, given its FFD definition), the error is identified and displayed to the user. The user can then correct the FFD (iteratively if required) and repeat the testing and validation procedure once the FFD definition has been corrected.

30

Although recommend, this step is optional.

31

Using the Prev and Next buttons the user can change the selected input file.

- 130 -

The FFD Editor During the comparison, if the Synchronize section selection feature is selected, clicking on any section in the sample or input file, will identify the corresponding section in the input and sample file, respectively, enabling a clearer traceability between the two files for a same sectioning definition. Further, if no error is found at the Extract level, the user can verify the correctness for the remaining Transformation and Data Delivery steps, for the selected input file. In this case the contents of the Transformation and Data Delivery tabs, would be simply populated with the contents of the input file data, testing the correctness of the entire ETD pipeline. Once the validation process is complete, this visualisation is terminated and the sample file data is replaced in the Transformation and Data Delivery tabs. If no error is found in the definition, then the FFD can be saved locally into the file system or uploaded directly to a Metadata Repository.

Figure 5.55: Comparing the sample file with the test set files

5.2.6 Recovery and Debug Changes on a file format may cause the file processing to fail, depending on the flexibility of the defined FFD. Failures can be detected at a structural level (e.g. unable to retrieve a section or field based on the FFD specification), data typing level (e.g. the data type for a single value, table column or transformation input is not according to the FFD) or at the validation rule level (e.g. one or more rules do not apply to a section, single value or table column definition). Once a failure is detected, the file processing is discarded and the failure event is registered in the ETD Engine log. If the DPM HMI application is

- 131-

The File Format Definition Language and Editor executing during the failure event, this logging information is also delivered to the application and displayed in the ETD engine graphical log. The FFD Editor is an active component in the recovery and debugging process. At program start, the user is asked if the application should inspect the ETD logs since its last execution. If affirmative, the FFD Editor processes the log files (starting from the date the FFD Editor last processed the logging data) and if a processing error is detected, that FFD is loaded into the application, together with the input file that raised the failure and displayed in debug mode. In debug mode the user can inspect the faulty input file and compare it to the sample file used in the FFD creation. Once the FFD is corrected, a new version can be uploaded into the Metadata Repository and the DPM HMI application can be used to perform the processing of the file in an adhoc manner. Input files that have previously been downloaded but failed processing, can then be reprocessed using the corrected FFD, which is made available via the Metadata Repository.

5.3 Language Expressiveness and Extensibility In the scope of the SESS system (presented in detail in section 6.2) the expressiveness and extensibility of the ETD approach and FFD language has been put to practice in a real-world scenario. The metadata required for the definition of Sections and Fields were found to be simple and sufficient for all the 62 input files encountered in the context of the system. For a great part of them (the simplest ones) regular expression expertise was not required and other, simpler specifications were found to be more suitable (e.g. prefix / suffix field definition, table definition based on delimiter characters). Data typing and validation rule mechanisms provided an added value to the FFD specification, enabling the detection of changes in the format of input files that were not caught at structural level time (i.e. during execution of the steps of extracting sections and fields). The usefulness of these mechanisms was verified by three times (during the development of the system), where format changes were detected and noisy data was prevented from being forward to the IL layer. Starting with an initial set of 15 transformations, due to file format needs, an extra set of 5 transformations was developed. Both the FFD language and FFD Editor application proved to be quite easy to extend in this regard. Only a simple metadata file and two Java classes had to be created per new transformation, for the FFD Editor and ETD Engine to be able to recognize them and process data accordingly. No change to the FFD language was required for the new transformations. The data delivery metadata proved to be quite generic. In the specific case of SESS, the FFD language was defined prior to the structural definition of the data delivery templates that would exchange data between the ETD and IL layers. No modification to the FFD language was required once the structure for the templates was agreed.

- 132 -

Chapter 6 Case Studies This chapter presents how the ETD+IL thesis has been put into practice, in a set of case studies. The presentation follows two perspectives: first the generality and versatility of the solution is explored for dealing with data from different domains. Second, it is explained how the Data Processing Module has been integrated, tested and validated in an operational system for a space domain system: SESS. Special attention will be placed in this second approach, starting with an overview of the SESS system objectives and the Galileo reference mission. Then, the overall SESS architecture is described, including a summarized explanation of all the components that have not been developed in the context of this thesis. The final section describes how the ETD+IL approach has been successfully applied to SESS and provides an evaluation of its usage.

- 133 -

Case Studies The validation of the thesis formulated in Chapter 3 is described in this chapter and has been performed using two approaches. First, the generality and versatility of the solution was tested using heterogeneous data from different domains, namely: stock trading, banking, geological, physical and spatial. This validation has been performed by configuring the data processing solution for downloading data files respective to each domain and developing FFDs 32 for each input file. As result, a set of small individual prototypes has been developed. Second, the data processing solution has been applied to a real operational system, not a prototype. For this approach the context, requirements and solution for the SESS space domain system are described, focusing on how the ETD+IL solution as been applied to the system. An analysis of the use of ETD +IL is presented, based on a set of metrics derived from the actual system functioning parameters and user feedback.

6.1 Versatility for Multiple Domains The ETD+IL approach has been applied to five domains, ranging from stock trading to spatial data, for determining its versatility. Due to the amount of available textual data in the Internet this was not a difficult task, existing many public data service providers that could be used in the tests described bellow. Two criteria were followed for choosing the test data sets / domains: the absence of a direct relation between them (independent domains) and the existence of dynamic data sets, updated frequently, for which an automatic data processing solution would provide added value.

6.1.1 Stock Trading Domain The stock-trading domain provides a good example of a near-realtime system, providing textual information regarding stock market trends to the community. Public information is limited to a small set of index values that do not include the actual stock exchange quotations. In order to acquire full data, a license has to be purchased (e.g. [77]) to the data service provider. One example of such files (available freely as sample data in the data service provider web page) in depicted in Figure 6.1 and has been used to test the ETD+IL approach. The explanation about the text file structure is not available in the file itself but at the data service provider web site [77].

32

The FFD creation did not followed any specific domain expertise (unknown to the

author) for data transformation. - 134 -

Versatility for Multiple Domains ABN|ABN Amro Holding N.V. ADS|000937102|||||0.009878||37705190000|1635800000|23.05|0|23.05|23.05|23.05|1635800000|1|1|1|1|1|1|1|NYSE| US0009371024|20040310|NL|USD AEG|Aegon N.V. Ord Shares|007924103|||||0.005017||19150809543.7776|1514377800|14.32|0|14.32|14.32|14.32|1514377800|1|0.8831|1| 1|1|1|1|NYSE|US0079241032|20040310|NL|USD AIB|Allied Irish Banks PLC ADS|019228402|||||0.003357||12816008574|845941160|30.3|0|30.3|30.3|30.3|845941160|1|1|0.5|1|1|1|1|NYSE|US01 92284026|20040310|IE|USD AL|Alcan Inc.|013716105|||||0.004335||16548947877.28|366695056|45.13|0|45.13|45.13|45.13|366695056|1|1|1|1|1|1|1|NYS E|CA0137161059|20040310|CA|USD ALA|Alcatel S.A. ADS|013904305|||||0.005121||19548332381.05955|1284301318|17.02|0|17.02|17.02|17.02|1284301318|1|0.8943|1|1| 1|1|1|NYSE|US0139043055|20040310|FR|USD

Figure 6.1: Part of text file example containing stock information [77]

6.1.2 Banking Domain Within the banking domain, textual data is limited either to tax rates or currency exchange rates. Many sites of national banks hold this information, as the Banco de Portugal Internet site [3]. Since the exchange rates are the most dynamic data (updated daily), an example has been selected from this area (Figure 6.2) that references exchange rates, in euros, since 1999-1-4 for all the daily rated currencies. 1999-1-4;1.91;-;1.8004;1.6168;0.58231;35.107;7.4501;15.6466;0.7111;327.15;-;251.48;-;133.73;-;-;-;;8.855;2.2229;4.0712;-;9.4696;-;189.045;-;-;1.1789;6.9358;1.4238;110.265;9.4067;244.383;1.19679;-;-;-;-;-;;-; 1999-1-5;1.8944;-;1.7965;1.6123;0.5823;34.917;7.4495;15.6466;0.7122;324.7;-;250.8;-;130.96;-;-;-;;8.7745;2.2011;4.0245;-;9.4025;-;188.775;-;-;1.179;6.7975;1.4242;110.265;9.4077;242.809;1.20125;-;-;-;-;-;;-; 1999-1-6;1.882;-;1.7711;1.6116;0.582;34.85;7.4452;15.6466;0.7076;324.72;-;250.67;-;131.42;-;-;-;;8.7335;2.189;4.0065;-;9.305;-;188.7;-;-;1.1743;6.7307;1.4204;110.265;9.3712;244.258;1.20388;-;-;-;-;-;-;-; 1999-1-7;1.8474;-;1.7602;1.6165;0.58187;34.886;7.4431;15.6466;0.70585;324.4;-;250.09;-;129.43;-;-;-;;8.6295;2.1531;4.0165;-;9.18;-;188.8;-;-;1.1632;6.8283;1.4074;110.265;9.2831;247.089;1.21273;-;-;-;-;-;-;-; 1999-1-8;1.8406;-;1.7643;1.6138;0.58187;34.938;7.4433;15.6466;0.7094;324;-;250.15;-;130.09;-;-;-;;8.59;2.1557;4.0363;-;9.165;-;188.84;-;-;1.1659;6.7855;1.4107;110.265;9.3043;249.293;1.20736;-;-;-;-;-;-;-; 1999-1-11;1.8134;-;1.7463;1.6104;0.58167;35.173;7.4433;15.6466;0.7044;323.4;-;249.7;-;126.33;-;-;-;;8.5585;2.1257;4.032;-;9.0985;-;188.9655;-;-;1.1569;6.791;1.4005;110.265;9.2329;251.013;1.22081;-;-;-;-;-;;-;

Figure 6.2: Part of text file example containing exchange rates [3] The explanation about the text file structure can be found in [3]. - 135-

Case Studies

6.1.3 Geological Domain Within the geological domain exists multiple data sets of public text-based information, ranging from static information regarding soil properties of a given country / region to near-real-time data for geological related events. This last data set is the most appealing for applying the ETD+IL solution due to its high refresh rate, justifying plainly an automatic data processing solution. Data from two near-real-time geological alarm systems have been selected for this purpose that enable the monitoring of earthquake [78] and volcano [79] occurrences. Src,Eqid,Version,Datetime,Lat,Lon,Magnitude,Depth,NST ci,10220897,1,"December 03, 2006 14:23:10 GMT",33.3498,-116.3186,1.4,16.40,38 ci,10220893,1,"December 03, 2006 13:44:22 GMT",36.5321,-117.5756,1.7,2.30,25 hv,00021774,0,"December 03, 2006 13:18:52 GMT",20.0145,-156.0878,2.5,4.90,00 ci,10220889,1,"December 03, 2006 13:10:10 GMT",33.3656,-116.3988,1.4,12.50,17 hv,00021773,0,"December 03, 2006 12:41:26 GMT",20.0178,-156.0533,2.8,10.30,00 nc,51176492,1,"December 03, 2006 12:30:26 GMT",37.9132,-122.1070,1.3,9.20,20 hv,00021772,0,"December 03, 2006 12:22:19 GMT",20.0608,-156.0787,4.4,8.00,00 ci,10220885,1,"December 03, 2006 11:26:06 GMT",36.1035,-117.6626,1.4,3.30,13 ak,00073026,5,"December 03, 2006 11:25:06 GMT",63.3522,-149.2984,2.1,90.00,10 ci,10220877,1,"December 03, 2006 10:44:49 GMT",33.6775,-116.7146,1.5,23.60,22 ci,10220873,1,"December 03, 2006 10:33:08 GMT",33.0630,-115.9320,1.1,7.70,25 ci,10220869,1,"December 03, 2006 10:21:37 GMT",33.6730,-116.7583,1.8,17.50,76 nc,51176491,1,"December 03, 2006 10:19:17 GMT",36.4552,-121.0322,1.7,6.30,16 us,vvaq,6,"December 03, 2006 09:31:41 GMT",39.4334,143.1486,4.8,30.60,62 ci,10220861,1,"December 03, 2006 08:20:49 GMT",34.3721,-117.7458,1.9,0.00,13 ci,10220857,1,"December 03, 2006 08:20:47 GMT",33.5010,-116.5713,1.5,9.80,23 us,vvam,7,"December 03, 2006 08:19:50 GMT",-0.6310,-19.7690,5.1,10.00,32

Figure 6.3: Part of text file containing earthquakes occurrences data [78] The two text files depicted in Figure 6.3 and Figure 6.4 regarding daily earthquakes occurrences and volcano daily alerts, respectively, have been selected, analysed and then processed with the ETD+IL solution successfully. Year Day Terra Aqua Total 2000 055 232 0 232 2000 056 203 0 203 2000 057 288 0 288 2000 058 277 0 277 2000 059 259 0 259 2000 060 288 0 288 2000 061 267 0 267 2000 062 277 0 277 2000 063 288 0 288 2000 064 288 0 288 2000 065 288 0 288 2000 066 287 0 287 2000 067 273 0 273 2000 068 279 0 279 2000 069 256 0 256

Figure 6.4: Part of text file containing volcano daily alerts [79] - 136 -

Versatility for Multiple Domains

6.1.4 Physical and Spatial Domains Physical and Space domain data has also been applied to determine the versatility of the ETD+IL solution, following two approaches. An individual testing prototype has been configured for dealing with Potentially Hazardous Asteroids (PHA) data [80]. Information regarding PHA is updated on daily basis or as soon a new relevant PHA object is found (Figure 6.7). A broader test to ETD+IL has been performed in the context of a wide space domain operational system – SESS – which is detailed in the following section. Data received as input by SESS refers to physical measures from the solar activity (e.g. Figure 6.5, Figure 6.6), telemetry from the spacecraft (e.g. house-keeping telemetry 33) and spacecraft orbital data. :Product: 20050427events.txt :Created: 2005 Apr 28 0302 UT :Date: 2005 04 27 # Prepared by the U.S. Dept. of Commerce, NOAA, Space Environment Center. # Please send comments and suggestions to [email protected] # # Missing data: //// # Updated every 30 minutes. # Edited Events for 2005 Apr 27 # #Event Begin Max End Obs Q Type Loc/Frq Particulars Reg# #------------------------------------------------------------------------------5170

0407

////

0409

LEA

C

RSP

065-136

III/1

5180 +

0452

////

0452

SVI

C

RSP

032-075

III/1

5190

0641

////

0648

SVI

C

RSP

029-076

III/1

5200

1004

1008

1012

G10

5

XRA

1-8A

B1.4

5210 +

1235

////

1235

SVI

C

RSP

025-050

III/1

5220 +

1418

////

1423

SVI

C

RSP

025-081

III/1

5230 +

1433

////

1433

SVI

C

RSP

025-072

III/1

5260

1554

////

1554

SVI

U

RSP

025-061

III/1

5270 + 5270

1914 1926

1922 1930

1934 1932

G12 G12

5 5

XRA XFL

1-8A S06E52

B2.9 3.0E-04 1.0E+02 1.7E+02

5280 +

2002

2005

2008

G12

5

XRA

1-8A

B2.0

6.3E-05

0756 0756

6.3E-05

Figure 6.5: Part of a text file example of solar activity events [2]

:Product: 0427GEOA.txt :Issued: 2005 Apr 27 0335 UTC # Prepared by the U.S. Dept. of Commerce, NOAA, # Space Environment Center. # Geoalert WWA117 UGEOA 20401 50427 0330/ 9935/ 11271 20271 30271 99999 UGEOE 20401 50427 0330/ 26/00 99999 UGEOI 20401 50427 0330/ 26/// 10020 20910 30030 40000 50000 61207 71404 80001 90550 99999 UGEOR 20401 50427 0330/ 26/24 27101 10756 20000 30400 44535 50550 60010 25506 16200 99999 PLAIN

Figure 6.6: A text file example of flare, magnetic and proton forecasts [2]

33

Spacecraft telemetry for accessing the status of the spacecraft’s internal systems. - 137-

Case Studies List Of The Potentially Hazardous Asteroids (PHAs) Information on converting absolute magnitudes to diameters is available, as is an explanation of the quantities given in the listings above. A list of close approaches to the earth through the end of the 21st century is available. NOTE:The quantity EMoid in the table below does not give any information on actual close approaches to the earth--you should consult the previously-referenced list for such details. See a plot of the innermost solar system (or the inner solar system) This list is updated daily, except for the few days surrounding the preparation of each batch of MPCs. It is also updated as and when new objects are discovered. --------------------------------------------------------------------------------

Designation (and name) Prov. Des. Discovery date, site and discoverer(s)

EMoid

q

Q

Peri. Node

1.008

2.508

20.1

20060922

335.5

178.4

232.7

15.4

0.427

1.758

1.192

19.4

20060922

102.2

291.0

328.1

11.6

0.300

2006 11 19

2006 WX1 0.02971 703 Catalina Sky Survey

0.641

2006 WX1

0.984

3.996

20.3

20060922

343.8

170.6

244.9

13.7

2006 11 19

2006 WT1 704 LINEAR

0.00195

2006 WT1

0.861

2.484

20.3

20060922

299.6

2006 11 18

2006 J75

0.02327

2006 WH1

4.186

20.8

20060922

2006 11 15

2006 VT13 0.00585 703 Catalina Sky Survey

0.952

2006 VT13

1.591

20.1

20060922

2006 11 14

2006 VQ13 704 LINEAR

0.610

2006 VQ13

1.066

21.4

20060922

2006 11 13

2006 VG13 0.04588 E12 Siding Spring Survey

0.570

2006 VG13

2.887

2006 11 13

2006 VD13 704 LINEAR

1.007

2006 VD13

136.3 313.6

8.1

0.393

Ref.

0.01844

9.6

112.1

Opps.

2006 WJ3 704 LINEAR

355.3

136.4

a

2006 11 19

20060922

143.5

e

2006 WJ3

18.9

20060922

Incl.

2006 WQ29 0.02071 G96 Mt. Lemmon Survey

0.03290

18.0

M

2006 11 22

0.00872

2.230

Epoch

2006 WQ29

WH1

0.971

H

1.601

2

E2006-X11

(

14d)

E2006-X12

0.917

(

20d)

E2006-X27

0.605

2.490

(

15d)

E2006-X17

262.8

241.1

2.7

0.485

1.672

(

47d)

E2006-X27

40.5

324.6

2.2

0.629

2.569

(

23d)

E2006-X22

16.7

0.446

1.100

(

20d)

E2006-X17

5.9

0.304

0.818

(

24d)

E2006-X27

1.947

(

24d)

E2006-X22

73.8 115.1 162.6

233.8 96.8 314.2

11.7

0.483

Figure 6.7: Part of a text file example of potentially hazardous asteroids [80]

- 138 -

Designation (and name)

Space Environment Support System for Telecom / Navigation Missions

6.2 Space Environment Navigation Missions

Support

System

for

Telecom

/

The Space Environment Support System for Telecom / Navigation Missions (SESS) [7] is a multi-mission decision support system, capable of providing near real-time monitoring and visualization [8], in addition to offline historical analysis [81] of S/W and S/C data, events and alarms to Flight Control Teams. This system is based on the Space Environment Information System for Mission Control Purposes (SEIS) [9] experience and developed prototypes. Similarly to SEIS, the main goal of the system is to provide S/C and S/W data integration to Flight Control Teams. The system has been developed for the European Space Agency (ESA) by a consortium formed by Deimos Space [82] as prime contractor, UNINOVA [38] and INTA [83]. The author participated in this project in the context of the CA3 research group [84], which belongs to UNINOVA. In this project, the author had the responsibility of developing an ETL solution that enables the automatic download and processing of textual data from a heterogeneous set of data sources. Further, a set of management and control applications were also developed in this context.

6.2.1 Galileo Mission Although a generic system, SESS scope has been defined to support navigation and telecommunication missions regarding space weather conditions. In the context of SESS two reference missions were considered: Galileo In-Orbit Validation Element A (GIOVE-A) and GIOVE-B prototype satellites that belong to the novel Galileo satellite cluster (Figure 6.8 and Figure 6.9).

Figure 6.8: Galileo cluster (artist's impression)

- 139-

Case Studies Nowadays, European users of satellite navigation are limited to usage of the United States Global Positioning System (GPS) or the Russian Global Navigation Satellite System (GLONASS) satellites. Since both satellite classes are property of the military, no guarantee of service disruption is provided with regard to their public civil use. Satellite positioning is becoming a standard way of navigating on the high seas and in a near future, it will certainly spread to both land and air vehicles. When such dependency is reached, the implications of a signal failure will be serious, jeopardising not only the efficient running of transport systems, but also human safety. Galileo shall be Europe’s first global navigation satellite system, providing a highly accurate and available global positioning service under civilian control, reaching the positioning accuracy down to the metre range. Further, it will be capable to interface with the existing GPS and GLONASS (already existing) navigation systems. It will guarantee service availability, informing users within seconds of a failure on any satellite. This will make it suitable for applications where safety is crucial, such as running trains, guiding cars and landing aircraft. The first experimental satellite, called GIOVE-A (Figure 6.9 – left side) was launched at the end 2005. Due to the success of GIOVE-A in establishing a link to Earth within the assigned radio frequency, GIOVE-B (Figure 6.9 – right side) launch has been continuously postponed.

Figure 6.9: Galileo spacecraft prototypes in orbit (artist's impression): GIOVE-A (left side) and GIOVE-B (right side) The fully deployed Galileo system consists of 30 satellites (27 operational + 3 active spares), positioned in three circular MEO planes at 23 222 km altitude above the Earth, and at an inclination of the orbital planes of 56 degrees with reference to the equatorial plane. The large number of satellites, together with the optimisation of the constellation, and the availability of the three active spare satellites, will ensure that the loss of one satellite will be transparent to the remaining.

- 140 -

Space Environment Support System for Telecom / Navigation Missions

6.2.2 Objectives Since SESS is an evolution of SEIS, all SEIS objectives have been inherit by this project, namely: o

Reliable S/W and S/C data integration;

o

Inclusion of S/W and S/W effects estimations generated by a widely accepted collection of physical S/W models;

o

Near real-time alarm triggered events, based on rules provided by the Flight Control Teams;

o

Near real-time visualization of ongoing S/W and S/C conditions through the Monitoring Tool;

o

Historical data visualization and correlation analysis using OLAP technology – Reporting and Analysis Tool.

Further, since SEIS has been developed as a prototype and while SESS intends to be an operational system, the core SEIS system components - DPM, Metadata Repository and DIM – had to be redesigned and re-implemented in order to offer an open-source, scalable and high performance solution. Functionalities and applications that were found required (in the scope of SEIS) have also been placed as requirements for the SESS solution (e.g. the development of a graphical editor for creating FFDs). Also, SEIS is a mono-mission system for a single satellite (Integral was used as reference mission) while SESS intends to be a multi-mission system, e.g. for all Galileo satellites, as well as other missions. Thus security and isolation measures must be implemented in SESS (that were missing in SEIS) since spacecraft telemetry is usually proprietary and confidential among missions.

6.2.3 General Architecture The SESS system is composed by a Common infrastructure and possibly multiple Mission infrastructures (Figure 6.10). Public DSP

Common Infrastructure: ESA

Metadata Repository Client DIM Tools

DPM Public data subscription XMM DSP TELSAT DSP GALILEO DSP

Mission Infrastructure: XMM Mission Infrastructure: TELSAT Mission Infrastructure: GALILEO

Private data publication Metadata Repository Client

Metadata Repository DPM DIM Tools

Client

Metadata Repository DPM DIM Tools

DPM

DIM

Client Tools

Figure 6.10: SESS Common and Mission infrastructure interaction (Mission perspective) The Common infrastructure contains public domain data: S/W measures taken by public observatories or universities and S/C data made public to the scientific community (e.g. S/C onboard radiation monitor measures). A Mission infrastructure holds proprietary - 141-

Case Studies data: either internal S/C telemetry or confidential S/W measures taken by the internal sensors of the spacecraft. All infrastructures implement a subscription / publication scheme based on Really Simple Syndication

(RSS)

[85].

This

communication

scheme

enables

for

any

Mission

infrastructure to publish common parameters in a Common infrastructure in order to be accessible via subscription to other Missions. Independently of being a Common or Mission, each infrastructure follows the same generic architecture depicted in Figure 6.11. Public Data Service Providers (http / ftp)

SESS Infrastructure Architecture Metadata Repository

txt

...

Monitoring Tool

txt

Data Procesing Module

Data Integration Module

txt

Reporting and Analysis Tool

Private Data Service Provider (http / ftp / web service)

Figure 6.11: Generic SESS infrastructure The Data Processing Module is responsible for the download and ETD processing of each provided file made available by the different public and / or private data service providers. Communication with the data service providers is performed via well-known protocols like HTTP and FTP or through a Web-Service interface. After processed, data is delivered to the Data Integration Module that comprises two databases: an Operational Data Storage for the parameters values within the last five days (sliding-window) and a Data Warehouse for storing historical data. Data stored in the Operational Data Storage is visible through the Monitoring Tool [8] while historical data can be analysed through the Reporting and Analysis Tool [81].

6.2.3.1 Metadata Repository Due to the complexity of the SESS system, multiple definitions are spread throughout its components. Such definitions – e.g. domain concepts as SW and SC parameters, user settings and applications configuration – are common and shared by multiple system components: metadata. These are gathered together in a single component – Metadata Repository [74, 75] – in order to maintain information coherence and consistency throughout all system components. SESS is a metadata-oriented system. On start, every system component connects to the Metadata Repository in order to acquire its latest configuration and on exit they may update the configuration back. Then, depending on each application needs and purposes, other metadata operations (e.g. queries) may be performed to the Metadata Repository.

- 142 -

Space Environment Support System for Telecom / Navigation Missions Since metadata can be used for representing a high variety of different concepts, the Metadata Repository solution must be generic enough to accept any type of metadata, independent from its internal structure. Since the XML technology is a W3C standard that enables simple representation of information, validation capabilities and is supported by a wide set of efficient tools and libraries it has been selected for storing, manage and manipulate metadata. Thus, all metadata instances are stored as XML documents [70], while all concepts (that define the instance’s structure) are represented through XML Schemas [86].

6.2.3.2 Multi Mission Module (3M) The Multi Mission Module, internal data service provider in the scope of SESS, provides data estimation and forecasting of both space weather and orbital data. In both cases, data previously stored in the Data Integration Module is used as input by 3M. Depending on the age of the input data the resulting forecast can be classified according to data quality measures. Space Weather estimations are generated by executing a set of well-known and commonly accepted physical and mathematical models within the space community (e.g. SPENVIS model set [36]). Estimations can vary according to solar activity or spacecraft dynamic orbiting position. A set of thirteen SW models have been implemented in the scope of 3M [9].

6.2.3.3 Data Integration Module The Data Integration Module is SESS’s supporting infrastructure, responsible for storing all data - either real-time or historical – and making it available to SESS services and client applications. The database design was strongly influenced by SEIS DIM solution [87], being reimplemented in SESS using a different technology: SQL Server was replaced by Oracle Database 10g [88]. Not much information regarding this module is available since it was not made public by the company responsible for developing the module (DEIMOS Space). However and considering that SEIS DIM scheme has been followed, SESS DIM should probably be composed by three databases: o

Operational Data Store: A simple transactional database oriented for real-time activities. It is the source of all Monitoring Tool data;

o

Data Warehouse: An historical database that provides direct or indirect OLAP data support. It is the source of all Reporting and Analysis Tool data;

o

Data Mart: A Multidimensional On-Line Analytical Processing (MOLAP) database that supports Reporting and Analysis Tools, containing pre-computed data aggregations for improving querying speed.

6.2.4 ETD+IL Integration and Usage in SESS The ETD+IL approach and the Data Processing Module architecture and software, have been devised and implemented during the execution of the SESS system. At the time,

- 143-

Case Studies both the architecture and software were designed and implemented, respectively, in such a way that the outcome solution would not be hardcoded to SESS, but rather be as a generic domain-independent solution. Such solution comprised a one-year work: two months for requirement definition and architectural design, six months for development and four months for corrections, testing and validation. The ETD+IL approach fits SESS needs completely as data processing solution, since all input data is received from text files containing semi-structured text data. In the scope of SESS a set of 62 provided files have been selected (as containing relevant data) from a total of 8 different Data Service Providers (Table 6.1): Space Environment Technologies [89], Solar and Heliospheric Observatory (SOHO) Proton Monitor [90], Sunspot Index Data Centre (SIDC) [91], Lomnicky Peak [76], NOAA [2], World Data Center for Geomagnetism [92], European Space Operations Centre (ESOC) and the Multi Mission Module (3M). For each provided file, a specific FFD has been created with the FFD Editor tool. Table 6.1: SESS Data Service Providers and associated Provided Files Data Service Provider Space Weather Technologies SOHO Proton Monitor Data Solar Influences Data Analysis Center (SIDC) Lomnicky Peak’s Neutron Monitor NOAA - Space Environment Centre (NOAA – SEC) World Data Centre for Geomagnetism European Space Operations Centre (ESOC) Multi Mission Module (3M) Total

Data S/W S/W

Protocol HTTP HTTP

Files 1 2

Identifiers 2 6

S/W

FTP

2

5

S/W

HTTP

1

2

S/W

FTP

52

542

S/W

HTTP

2

1

S/C

HTTP

1

113

S/W

Web Service

1

78

62

748

Following this perspective, the Data Processing Module only required to be duly configured with SESS specific metadata (e.g. Data Service Providers, Provided Files and File Format Definitions) to execute. Only the Data Delivery web service required the development of specific code for dealing with SESS specific database solution (which is outside the ETD solution scope). Since SESS input files are almost the same as the ones used in SEIS, it is possible to perform a comparison between the effort spent with and without the use of the FFD Editor graphical tool for the generation of FFDs. In SEIS, without any kind of graphical support, the FFD creation tasks required four months work, while in SESS this effort was downsized to 1 + ½ months, presenting an effort reduction of 62%. The user interactivity proposed in both the FFD Editor and DPM HMI proved to be adequate in practice with minor corrections motivated by user feedback. Regarding the creation of FFDs some commonly executed behaviour patterns have been detected and the application has been modified in order to provide a more automatic and user-friendly - 144 -

Space Environment Support System for Telecom / Navigation Missions response. One such improvement was performed for the creation of date values. The previously scheme was supported by a pipeline of append transformations that proved to be time consuming and prone to error 34. This pipeline was replaced by a dedicated transformation for creating a date where the user selects the six date / time elements and the separators for the date and time parts. Figure 6.12 to Figure 6.15 presents some statistics regarding the FFDs created in the scope of SESS, namely regarding the frequency of sections, fields, transformations and identifiers comprised in the data deliveries 35. Analysis for the section (Figure 6.12) and field (Figure 6.13) frequency shows that for the files taken as input by the SESS system, the number of sections is usually reduced (around 3 sections per file) as well as the number of fields (average bellow the 7 fields per file). Section Frequency per Input File 50 40 30 20 10 0 Frequency

1

2

3

4

>4

5

4

45

2

7

Figure 6.12: Section frequency per input file

Field Frequency per Input File 30 25 20 15 10 5 0 Frequency

1-2

3-4

5-6

7-8

>8

28

1

23

1

10

Figure 6.13: Field frequency per input file

34

“Year Field” appended to “-”, appended to “Month Field”, appended to “-”, appended to

“Day Field”, appended to “ ”, appended to “Hour Field”, appended to “:”, appended to “Minute Field”, appended to “:”, appended to “Second Field”. 35

Annex SESS File Format Definition Statistics presents the detail for the sections, fields,

transformations and identifiers defined for each of the 62 FFD created for SESS. - 145-

Case Studies The same conclusion is not applicable for the transformation usage (Figure 6.14) where the average number of operations reached the 31-40 threshold, denoting a moderate discrepancy between the way data is formatted when received as input, compared with the data format required for output. Finally, regarding data on the input files that can be referenced to parameters’ identifiers, it is possible to see that files with a reduced number of identifiers (within the 1 to 5 identifiers range) and files with a medium number of identifiers (within the 6 to 15 identifiers range) are the two most common distributions. Transformation Frequency per Input File 60 50 40 30 20 10 0 Frequency

1 - 10

11 - 20

21 - 30

31 - 40

> 40

2

4

4

48

5

Figure 6.14: Transformation frequency per input file

Frequency of Identifiers per Input File 40 30 20 10 0 Frequency

1-5

6 - 10

11 - 15

16 - 20

> 20

29

9

19

2

4

Figure 6.15: Frequency of identifiers per input file A quick inspection of the total values for the created sections, fields, transformations, data deliveries and identifiers involved, highlights in a clear way, the importance of the FFD Editor application. Otherwise, the XML syntax for all these statements had to be defined by a manual process, using an XML editor. The ETD supporting language proved to be expressive enough for dealing with all the encountered provided files, in terms of sectioning, extraction and data delivery. Since transformations

are

highly

dependent

on

the

file

structure,

the

decoupling

of

transformation representation from the ETD language proved to be a good approach, specially when a plugin approach was followed. From the initial set of 15 transformations, 5 extra transformations were developed during the FFD creation task that corresponds to

- 146 -

Space Environment Support System for Telecom / Navigation Missions specific necessities for a small sub-set of provided files. Using the transformation plugin architecture, the source code previously developed was maintained in isolation from the new source code, avoiding the change of operational transformations and the inclusion of new bugs. During nominal operational execution, the DPM downloads (concurrently) around 4000 files per day representing an average storage load of 140 MB of text per day (average of 35KB per text file). The performance of the ETD engine based on FFDs declarative language presented very good results, specially when considering that a near real-time requirement has imposed by ESA that the entire download / processing / data delivery tasks should not oversee 5 minutes. In SESS, processing times range from a 1 second average for small files to 30 seconds average for large telemetry files (3MB). The overall average per file processing is around 2 seconds. The data-typing scheme present in ETD proved to be useful during SESS maintenance phase, where the format change of three files was detected and reported to the system administrator. This way, invalid values were not delivered for insertion on the system databases. Data traceability was applied in the final stages of system validation for tuning the created FFDs. During this stage invalid values were found in SESS database, resulting from the lack of data typing at the FFD level for some provided files. Performing data traceability from the data delivery that contained the invalid data, it was possible to determine the FFD that was incomplete, as well as the input text files that were the source of the invalid data propagation.

- 147-

Chapter 7 Evaluation and Conclusions The final chapter summarizes the work described in this report, presenting an overview and evaluation of the ETD+IL conceptual approach. Overall conclusions are presented and future work in the field is proposed, pointing to an evolution for the solution herein presented.

- 149 -

Evaluation and Conclusions The final chapter summarizes the work described in this report, focusing on the evaluation of the proposed conceptual approach and its supporting prototype application. The initial conceptual approach and the requirements previously established in Chapter 3 are revisited and each is discussed regarding its degree of accomplishment. Finally, some conclusions are presented as well as future work, mostly regarding application evolution.

7.1 Conceptual Approach Evaluation In order to provide a clear separation of concerns, this report presented a different approach to ETL, separating domain ETD operations that require domain expertise from technical IL that require computer science operations, such that ETL = ETD + IL. This approach intends to ease the data processing process for semi-structured data, available in text files, using a specialized tool suite, which can be effectively handled by non-expert users. Including a domain-related Data Delivery phase, the domain expert can define which processed data shall be delivered to the target system model, as output of the Extraction and Transformation phases. During Data Delivery the domain expert uses an abstraction that enables a delivery completely transparent from the target application that will use the processed data, as well as the internal structure in which processed data is stored. The new Integration and Loading phases require uniquely computer-science expertise. The Integration step can be decomposed in three main tasks. First, different data deliveries may be gathered together. Then, a unified view is produced where operations such removal of duplicates or creation of artificial keys may be performed. Finally, the unified view may suffer a format change depending on the specific procedure used for data loading or on the target data store requirements / design. The Loading phase simply consists on the invocation of a loading program given a pre-formatted data set (output from the Integration phase). The conceptual approach has been put to practice and taken as foundation for the requirement, analysis, design, development and validation phases, in the development of the DPM solution. Differentiating domain from computer-science operations, the development time required for an ETL solution is reduced and the overall data quality is improved by a close validation of domain data by a domain-expert (instead of a computer-science expert). It was possible to determine the validity and usefulness of separating domain from technical expertise, in the scope of the SESS system, where a drastic reduction in the FFD production time has took place (compared with a similar effort performed for the SEIS system, not following this approach). Further, FFDs have not only been created faster but also contain extra domain expertise (e.g. data quality) making them more valuable and complete. The validation for the conceptual approach has been performed in two ways: (i) The generality and versatility of the solution was tested using heterogeneous data from different domains. As result a set of small individual prototypes has been developed. (ii) The data processing solution has been applied to a real operational system (SESS), not a - 150 -

Requirements for ETD prototype. The approach was suitable for SESS, since all input data is received from text files containing semi-structured text data. In the scope of this system a set of 62 provided files have been selected as containing relevant data from a total of 8 different Data Service Providers. For each provided file a specific FFD has been created with the FFD Editor tool by an external domain expert (not the author of the DPM solution). Following this perspective, the Data Processing Module only required to be duly configured with SESS specific metadata to execute. Only the Data Delivery web service required the development of specific code for dealing with SESS specific database solution (which is outside the ETD solution scope). The user interactivity proposed in both the FFD Editor and DPM HMI proved to be adequate in practice with minor corrections motivated by user feedback. The ETD supporting language also proved to be expressive enough for dealing with all the encountered provided files, in terms of sectioning, extraction and data delivery. Since transformations are highly dependent on the data nature / format present in the input files, the mechanism for adding new transformations to the FFD language (and Editor application) using a plugin philosophy, proved to be a good approach, easing the introduction of new transformations.

7.2 Requirements for ETD In order to implement a suitable DPM solution according to the ETD + IL conceptual approach, a set of thirteen requirements have been proposed in Chapter 3. Follows an individual evaluation for each according to their accomplishment.

7.2.1 Free, Open Source and Independent The solution shall be implemented using open-source technologies, presented as a no acquisition cost solution, accessible to anyone. Furthermore, the solution shall be developed using software independent from the operating system. (Accomplished) In the development of the data processing solution only free, open source and OS independent software has been used. Three technologies have mainly contributed for attaining this requirement: Java programming language, XML technologies / libraries (following W3C open standards) and the public Tomcat Web Server for managing application communication through web services. External software components (e.g. graph visualization libraries) with lesser relevance in the overall development context have also followed these three guidelines. This way, the entire data processing solution, independent from the application, component or package, is open to future extensions, developed by third parties.

7.2.2 Completeness A complete data processing solution shall be available comprising data retrieval, data processing and overall management of the data processing solution. (Accomplished) Although the FFD declarative language and the FFD graphical editor are the cornerstone technologies for the ETD + IL approach, other three applications have been developed in - 151-

Evaluation and Conclusions order to attain a complete data processing solution: (i) the File Retriever web service, responsible for all data retrieval tasks, (ii) the ETD web service, responsible for all data processing, (iii) the DPM HMI application, which provides overall management of the data processing solution. Together, these three components comprise a complete data processing solution.

7.2.3 Separation of Concerns The domain user shall be able to use and maintain the data processing pipeline without requiring computer-science expertise. All domain procedures and definitions shall be represented recurring to a high-level declarative language. No specific source-code shall be required to implement the processing of a single text file. (Accomplished) The separation of domain from technical computer-science knowledge has been a constant

throughout

the

thesis

execution.

No

computer-science

expertise

(e.g.

programming effort, database or XML schemas, querying languages) is required for representing the ETD domain knowledge. All technical tasks are comprised to the IL layer. Programming tasks have been replaced by ETD procedures and definitions, represented in the FFD language, which are then made transparent to the user via the FFD Editor graphical application. Further, the domain user can manage the entire data processing pipeline through the DPM HMI application. With DPM HMI the metadata creation / edition process is simplified (i.e. XML and communication details are abstracted) and the user can control and monitor the data processing execution and status, respectively.

7.2.4 User Friendliness A graphical application shall be available, making use of the declarative language in a transparent way to the end user. (Accomplished) For accomplishing this requirement, the FFD Editor application has been developed. Special care has been taken with the human-machine interaction, where a set of gestures, graphical notations (e.g. icons, colours) and wizards has been implemented. Instead of creating explicitly a FFD instance given the FFD language constructs, the user performs annotations over a sample file and the FFD definitions are inferred from them, enabling

a

greater

independence

between

application

interaction

and

metadata

representation. Although with lesser relevance, the same guidelines have been applied to the DPM HMI graphical application, enabling an intuitive monitoring and control of the data retrieval and processing pipeline.

7.2.5 Performance Data retrieval and data processing shall have a reduced response time while preserving both CPU and network bandwidth resources. (Accomplished)

- 152 -

Requirements for ETD The developed DPM solution accomplished good performance for performing data retrieval and data processing tasks, measured by the time required for retrieving and processing a file, respectively. Considering both cases, the data processing task is the one that is mostly dependent of a good performance. In this respect, the processing performance has been tunned in order for the DPM solution to work in two common extreme scenarios: frequent requests for processing small data files and non-frequent requests for processing large data files (i.e. 3 Megabytes in length).

7.2.6 Scalability Both data retrieval and data processing must be capable of handling multiple simultaneous downloads and processing requests, respectively. (Accomplished) The DPM architecture has been devised in order to be scalable when the number of files to retrieve and / or data load to process increases. The architectural setup for the DPM solution is very dynamical and multiple FR and ETD instances may be placed in different Internet connections and machines, to increase the data retrieval and data processing throughput, respectively. Special attention has been placed in the scalability for the data processing task, since it represents the most common case. Three load balancing schemes have been created in order to ease the data processing distribution over different ETD instances executing on different machines: (i) Specific-routing where files are processed in specific ETD engines (ii) Restricted Round-Robin where files are processed by one processor from a pool of ETD engines (a subset of all the available ETD instances), cyclically (iii) Unrestricted Round-Robin where files are processed by one processor from a pool with all the ETD engines available, cyclically.

7.2.7 Modularity The solution architecture and implementation shall be as modular as possible, clearly separating the ETD pipeline from the IL pipeline. Further, there shall be a clear separation between logic and presentation layers, easing future maintenance tasks. (Accomplished) The solution has been developed as modular as possible with different web services for each of the retrieval (FR web service), ETD (ETD web service) and IL (Data Delivery Interface web service) tasks that may be deployed in different machines. These three components represent the logical layer presented in the conceptual approach, that is kept in isolation from the graphical layer: DPM HMI, FFD Editor and Log Analyser.

7.2.8 Reusability System modules shall be designed and implemented focusing on reutilization as much as possible. Such approach shall be applied for factoring common behaviour / functionalities within the data processing solution itself or for reusing entirely / partially system components in the solution of other problems. (Accomplished)

- 153-

Evaluation and Conclusions The data processing components have been implemented according to the reusability requirement,

developed

as

generic

as

possible,

abstracting

domain

specific

characteristics. This concern has been mainly visible in the implementation of four main components / libraries: o

The component responsible for FFD interpretation and file processing has been developed as a separated library that is used both by the FFD Editor application at design time and by the ETD engine at execution time;

o

A logging component for both low-level read / write logic and graphical interaction with filtering and querying capabilities has been developed and used by the FR web service (logical part), ETD web service (logical part), DPM HMI (graphical part) and Log Analyser (graphical part);

o

An XML library has been created, presenting an extra layer (closer to the developer) compared to Java’s XML facilities that are to low-level. Since all data retrieval and processing metadata is codified in XML format, the usage of this library is crosscutting to the entire DPM solution;

o

In a similar way a library for regular expressions has been developed that is used at FFD design time and at the actual data processing.

Further, since the entire data processing solution is generic by itself, not containing specific domain knowledge hardcoded to its components, with the correct metadata configuration the DPM solution can be reused as a COTS package for data processing.

7.2.9 Metadata Driven The data processing solution shall be metadata driven, which means that all processes for executing and managing the data retrieval and ETD pipeline are based on metadata. (Accomplished) The DPM is a metadata-oriented solution, where metadata is used to instantiate the generic data processing pipeline according to each domain needs. Both data retrieval and data processing tasks are executed according to the defined metadata. Although omnipresent in all system components, metadata is made transparent as much as possible to the end user, through the use of graphical applications. A Metadata Repository is proposed for storing and supporting all metadata operations. However, DPM is not dependent on this module and can use metadata as stand-alone files without resource to other application besides the File System.

7.2.10 Correctness Data typing facilities and validation rules shall be available during the entire ETD process, in order for the outcome of ETD to be valid. These data quality mechanisms shall be applied

iteratively

in

the

Extraction,

Transformation

and

Data

Delivery

steps.

(Accomplished) Data correctness has been implemented in the DPM prototype by including data types and validation rules in all three phases of the ETD process.

- 154 -

Requirements for ETD During the SESS system execution time in three occasions errors were detected due to the usage of data typing and validation rules, contributing for the detection of file format changes without propagating invalid values into the IL layer.

7.2.11 Validation After performing the ETD specifications based on a primary input file, the FFD generality shall be tested with a larger set of text files belonging to the same class of files. (Accomplished) Since the FFD specification is based on a single sample input file, it may not be generic enough for processing all the data files within that class. For this reason, when concluding a FFD and prior to the saving / deployment operation, the FFD Editor application suggests the domain user to test the FFD generality using multiple input files. If errors are detected, these are displayed to the user to modify the initial specification before making it available for online data processing.

7.2.12 Data Traceability It shall be possible to trace-back a processed datum value, to the originally downloaded file. (Accomplished) A data traceability scheme has been implemented in the DPM solution that comprises the entire data retrieval, ETD and IL pipeline. At data retrieval (FR engine) for each download operation a timestamp, original file location and file location within the FR Cache are stored. At the ETD level both the file name and FFD used in the data processing are kept together with timestamps for all major processing steps. Further, all data deliveries contain a unique serial number that is stored internally at the ETD engine log and is propagated to the IL layer. This way, traceability can be further expanded beyond the IL layer, if sufficient metadata is kept, linking each datum to the data delivery serial number. With this metadata it is possible to trace-back any processed datum value received at the IL layer to the original downloaded file present at the FR Cache.

7.2.13 Fault Tolerance In case of failure during download, the recovery of the failed file shall be retried. If an error occurs during the data processing the administrator must be notified and other data processing operations shall be resumed. (Accomplished) Special emphasis has been placed in the data retrieval process in case of error during retrieval, attaining a balanced trade-off between data recovery versus network saturation. Thus, depending on the type of data (i.e. realtime or summary) the number and frequency of retrials is customized accordingly. In case of error (when applying a FFD to an input file) the system administrator is notified by email, containing as attachment the input file that raised the error and a reference to the FFD that failed processing.

- 155-

Evaluation and Conclusions

7.3 Conclusions This section provides some final remarks that summarise the work performed in the scope of the thesis and described in the current report. First, a study on the current state of the art has been conducted. This study was focused both on research, open-source and commercial ETL tools / work currently underway. Following these findings a set of requirements, difficulties and opportunities have been identified when dealing with semi-structured textual data. According to these, a conceptual approach that proposes the separation of ETL in domain-related and technical-related parts has been suggested. In order to prove the feasibility and correctness of the proposed approach, a set of high-level requirements has been defined, that was followed during the development of the application software. Two software components were identified as being closely related with the ETL+IL approach: a language that enables the representation of ETD statements and a graphical application that hides the language technical details from the domain user. However, in order to have a generic data processing solution that can be handled as an independent COTS, software components for data retrieval, monitoring and controlling the processing pipeline have been developed. The software development has occurred in the context of the space domain SESS system (not a prototype) that was also taken as case study for the validation of the conceptual approach and software. The results attained in the context of SESS, were quite positive in terms of the proposed separation of concerns, as well as, the software produced as result of the thesis operationalization. The conceptual approach generality was also validated with data from different domains, through the creation of a set of simple prototypes.

7.4 Future Work Although the conceptual approach feasibility and applicability have already been proved in practice, further enhancements can yet be introduced, mainly at the graphical applications level, through the inclusion of new functionalities as well as improving the user-machine interaction already existent. Follows some ideas of possible application enhancements: 1. The File Retriever application enables the download of web pages by defining specifically the URL where the page resides. However, for processing multiple web pages accessible via a common index web page, multiple Provided File metadata instances have to be created. The development of a web crawler that would iterate over all the pages accessible via direct links and try to process them using a FFD, would consist on a more user-friendly way of dealing with multiple related web pages; 2. The file processing could be optimized for very large text files (above the 3,5 MB threshold). Currently, during the processing, the entire textual data is placed in memory, making it non-scalable for large volumes of textual data. Further, the - 156 -

Future Work parallel execution of independent transformation pipelines should be explored, in order to also increase the processing performance; 3. The Extract panel in the FFD Editor application should be redefined as a plugin and a common interface should be agreed. This way it may be possible to introduce specialized Extract panels, depending on the type of semi-structured textual data. XML and HTML files are two possible examples. When using a XML source, through user interaction with the XML file, XQuery and XSLT statements could be automatically inferred. For an HTML source the rendered HTML content could be presented to the user, making the HTML syntax transparent. According to user interaction with the HTML page, parsing statements would be automatically inferred that took the HTML syntax into consideration; 4. The extension of the four data types currently existing for defining data quality (i.e. unbounded, numerical, textual and date) would be an interesting feature. The creation of custom data types (e.g. enumerations) should follow a plugin approach with logical and a graphical Java classes (similar to the transformation plugin scheme). The graphical class would consist on a panel that could be inserted in a generic data type frame within the FFD Editor application; 5. Currently, the user must explicitly define sections, single value and table fields. The FFD Editor application should propose (even in a simple way) a way to section a file and propose a set of fields. In a similar way, some data quality features could be automatically inferred, such the data type associated to a single value or table column; 6. It would be useful to have data validation rules that relate different data fields and / or transformation outputs. A rule editor, global to the entire FFD, could provide a way to compare fields, using arithmetic and logic operators in the process. If a data validation rule is triggered, a message would be presented to the user if using the FFD Editor or reported in the application log, if in the operational mode; 7. In the current version of the FFD Editor, the domain user must specify each transformation operation one-by-one. Due to the complexity of some files, a considerable amount of data transformations may be required, becoming a long and tedious process for the user. This process could be to some extend automated by enabling the application to propose a possible transformation pipeline. This proposal would be based on the output results (including the field value and data quality metrics like data type and validation rules) provided by the user, the available input set of extracted fields / transformation outputs and the transformation library definitions; 8. In the Transformation tab all the available transformations are present in a single toolbar. This schema is only feasible if the number of transformations is limited (as is the current case). Instead, transformations should be selected from a graphical control containing a tree where transformations would be grouped by type / functionality (similar to the scheme used for representing regular expressions). In

- 157-

Evaluation and Conclusions this

case

the

toolbar

would

be

used

just

for

storing

the

most

common

transformations; 9. Although sufficient in the scope of SESS, the scheme of having all the transformations at a same flat level is not scalable for files that require long set of transformations in a same pipeline. Both the FFD language and FFD Editor should enable a customizable multi-level approach for organizing transformations in a more simple and intuitive way; 10. Data delivery templates can only be created outside the FFD Editor application as metadata instances in the Metadata Repository. It should be possible for the domain user to define a data delivery template in a graphical way inside the FFD Editor application that could be then deployed to the Metadata Repository; 11. It could be an interesting feature to introduce user annotations (i.e. comments) and associate them to ETD elements (e.g. sections, fields, transformations); 12. The practical impact of including a filtering cache for dealing with repeated input data prior to the ETD process execution should be evaluated. A duplicated data unit can be a database record if data is providing from a database or a text line if the provider is a text file. With the current solution all data filtering must be performed either using a binary data service provider, that guarantees by itself no data repetition, or by processing all data (even repeated) without distinction and then performing the data filtering at the IL layer, resulting in unnecessary processing.

- 158 -

Chapter 8 References This section comprises all the bibliographic contents referenced throughout the report.

- 159 -

References

8.1 References 1. 2. 3. 4. 5. 6. 7. 8.

9. 10. 11. 12. 13. 14. 15. 16.

17. 18. 19.

20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

- 160 -

Caserta, J. and R. Kimball, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data, ed. J.W. Sons. 2004. Service, N.S.a.I. National Environmental Satellite, Data and Information Service (NESDIS). 2006 [cited; Available from: http://www.nesdis.noaa.gov/. Portugal, B.d. Exchange Rates. 2006 [cited; Available from: http://www.bportugal.pt/rates/cambtx/cambtx_e.htm. Microsoft. Microsoft Excel. 2006 [cited; Available from: http://office.microsoft.com/enus/excel/default.aspx. Daily, E. Space Weather: A Brief Review. in SOLSPA: The Second Solar Cycle and Space Weather Euroconference. 2002. Vico Equense, Italy. Schmieder, B., et al. Climate and Weather of the Sun Earth System: CAWSES. in SOLSPA: The Second Solar Cycle and Space Weather Euroconference. 2002. ESA. Space Environment Support System for Telecom/Navigation Missions (SESS). 2006 [cited; Available from: http://telecom.esa.int/telecom/www/object/index.cfm?fobjectid=20470. Moura-Pires, J., M. Pantoquilho, and N. Viana. Space Environment Information System for Mission Control Purposes: Real-Time Monitoring and Inference of Spacecraf Status. in 2004 IEEE Multiconference on CCA/ISIC/CACSD. 2004. Taipei, Taiwan. Pantoquilho, M., et al. SEIS: A Decision Support System for Optimizing Spacecraft Operations Strategies. in IEEE Aerospace Conference. 2005. Montana, USA. Booch, G., J. Rumbaugh, and I. Jacobson, The Unified Modelling Language User Guide. Object Technology, ed. Addison-Wesley. 1999. Fowler, M. and K. Scott, UML Distilled - A Brief Guide to the Standard Object Modelling Language. Object Technology, ed. Addison-Wesley. 2000. Barlas, D., Gartner Ranks ETL. Line56 - The E-Business Executive Daily, 2003. Barlas, D. Motorola's E-Business Intelligence. Line56 - The E-Business Executive Daily 2003 [cited; Available from: http://www.line56.com/articles/default.asp?NewsID=5104. Vassiliadis, P. and A. Simitsis. Conceptual modelling for ETL processes. in DOLAP. 2002. Vassiliadis, P., A. Simitsis, and S. Skiadopoulos. On the Logical Modelling of ETL Processes. in CAiSE 2002. 2002. Toronto, Canada. Simitsis, A. and P. Vassiliadis. A Methodology for the Conceptual Modelling of ETL Processes. in Decision Systems Engineering Workshop (DSE'03) in conjunction with the 15th Conference on Advanced Information Systems Engineering (CAiSE '03). 2003. Klagenfurt, Austria. Vassiliadis, P. A Framework for the Design of ETL Scenarios. in 15th Conference on Advanced Information Systems Engineering (CAiSE '03). 2003. Klagenfurt, Austria. Vassiliadis, P. A Generic and Customizable Framework for the Design of ETL Scenarios. in Information Systems. 2005. Vassiliadis, P., A. Simitsis, and S. Skiadopoulos. Modelling ETL Activities as Graphs. in 4th International Workshop on the Design and Management of Data Warehouses (DMDW'2002) in conjunction with CAiSE’02. 2002. Toronto, Canada. Galhardas, H., et al. Declarative Data Cleaning: Language, Model and Algorithms. in The 27th VLDB Conference. 2001. Rome, Italy. Galhardas, H., et al. AJAX: An Extensible Data Cleaning Tool. in SIGMOD 2000. 2000. Raminhos, R. ETL - State of the Art. 2007 [cited; Available from: http://raminhos.ricardo.googlepages.com/RFRETLStateoftheArt.pdf. Group, O.M. Meta-Object Facility. 2002 [cited; Available from: http://www.omg.org/technology/documents/formal/mof.htm. OMG. Common Warehouse Metamodel. 2003 [cited; Available from: http://www.omg.org/technology/documents/formal/cwm.htm. Informatica, How to Obtain Flexible, Cost-effective Scalability and Performance through Pushdown Processing. 2007. Ltd, E.S., Transformation Manager: Meta-data Driven Flexible Data Transforms For any Environment. 2005. Kimball, R. and M. Ross, The Data Warehouse Toolkit: The Complete Guide to Multidimensional Modelling, ed. Wiley. 2002. Informatica. PowerCenter 8. 2006 [cited; Available from: http://www.informatica.com/products/powercenter/default.htm. White, C. Data Integration: Using ETL, EAI and EII Tools to Create an Integrated Enterprise. 2005 [cited; Available from:

References

30. 31. 32. 33.

34. 35. 36. 37. 38. 39. 40.

41. 42.

43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53.

54.

55. 56. 57. 58. 59. 60.

http://whitepapers.zdnet.com/whitepaper.aspx?&docid=278917&promo=Data%20Acquisition %20-%20ETL. Linstedt, D. ETL, ELT - Challenges and Metadata. 2006 [cited; Available from: http://www.beye-network.com/blogs/linstedt/archives/2006/12/etl_elt_challen.php. Vassiliadis, P., et al., ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. IEEE Data Engineering Bulletin, 2000. Vassiliadis, P., et al., ARKTOS: Towards the Modeling, Design, Control and Execution of ETL Processes. Information Systems 26, 2001. Group, M. ETL Tools METAspectrum Evaluation. 2004 [cited; Available from: http://www.sas.com/offices/europe/czech/technologies/enterprise_intelligence_platform/Metag roup_ETL_market.pdf. Friedman, T. and B. Gassman. Magic Quadrant for Extraction, Transformation and Loading. 2005 [cited; Available from: http://mediaproducts.gartner.com/reprints/oracle/127170.html. Tools, E. ETL Tool Survey 2006-2007. 2007 [cited; Available from: http://www.etltool.com/etlsurveybackground.htm. ESA. SPENVIS - Space Environment Information System. 2006 [cited; Available from: http://www.spenvis.oma.be/spenvis/. ESA. ESA. 2007 [cited; Available from: http://www.esa.int. UNINOVA. Instituto de Desenvolvimento de Novas Tecnologias. 2006 [cited; Available from: http://www.uninova.pt/website/. Engenharia, D. Deimos Engenharia. 2006 [cited; Available from: http://www.deimos.pt/. Belgian Institute for Space Aeronomy, Space Applications Services, and P.S. Institute. SPENVIS - Space Environment Information System. 1998 [cited 2004; Available from: http://www.spenvis.oma.be/spenvis/. Ferreira, R., et al. XML Based Metadata Repository for Information Systems. in EPIA 2005 12th Portuguese Conference on Artificial Intelligence, Covilhã. 2005. Covilhã, Portugal. Viana, N., Extraction and Transformation of Data from Semi-structured Text Files in RealTime, in Departamento de Informática. 2005, UNL/FCT – Universidade Nova de Lisboa / Faculdade de Ciências e Tecnologia: Caparica. Microsoft. Microsoft .NET. 2006 [cited; Available from: http://www.microsoft.com/net/default.mspx. Microsoft. Microsoft Internet Information Services. 2003 [cited; Available from: http://www.microsoft.com/windowsserver2003/iis/default.mspx. Microsoft. Microsoft SQL Server 2005. 2006 [cited; Available from: http://www.microsoft.com/sql/default.mspx. Software, G. Data Flow. 2007 [cited; Available from: http://www.g1.com/Products/DataIntegration/. Software, G., Data Flow Data Integration Solution - Turn Disparate Data into Valuable Information. 2007. Software, G. Sagent Data Flow from Group 1 Software: an extract from the Bloor Research report, Data Integration, Volume 1. 2007 [cited. Solonde. TransformOnDemand. 2006 [cited; Available from: http://cms.solonde.com/cms/front_content.php?idcat=31. Solonde, Information Integration Architecture. 2007. Solonde, TransformOnDemand User Guide. 2007. Sunopsis. Sunopsis Data Conductor. 2006 [cited; Available from: http://www.sunopsis.com/corporate/us/products/sunopsis/snps_dc.htm. Consulting, R.M. Sunopsis Data Conductor : Creating an Oracle Project. 2006 [cited; Available from: http://www.rittmanmead.com/2006/11/16/sunopsis-data-conductor-creating-anoracle-project/. Consulting, R.M. Moving Global Electronics Data using Sunopsis. 2006 [cited; Available http://www.rittmanmead.com/2006/11/30/moving-global-electronics-data-usingfrom: sunopsis/. Consulting, R.M. Getting Started with Sunopsis Data Conductor. 2006 [cited; Available from: http://www.rittmanmead.com/2006/11/10/getting-started-with-sunopsis-data-conductor/. Laboratory, D.S. Visual Importer Professional. 2006 [cited; Available from: http://www.dbsoftlab.com/e107_plugins/content/content.php?content.50. Laboratory, D.S., Visual Importer Professional & Enterprise User Manual. 2006. iWay. DataMigrator. 2007 [cited; Available from: http://www.iwaysoftware.com/index.html. IWay, Data Integration Solutions. 2007. iWay, dm71demo. 2007.

- 161-

References 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74.

75.

76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87.

88. 89. 90. 91. 92.

- 162 -

iWay, iWay Adapter Administration for UNIX, Windows, OpenVMS, OS/400, OS/390 and z/OS. 2007. Informatica, Enterprise Data Integration. 2007. SAS. SAS Data Integration. 2006 [cited; Available from: http://www.sas.com/technologies/dw/index.html. Institute, S., SAS 9.1.3 ETL Studio: User's Guide. 2004. SAS, SAS ETL Studio - Fact Sheet. 2007. BusinessObjects. Data Integrator. 2006 [cited; Available from: http://www.businessobjects.com/products/dataintegration/dataintegrator/default.asp. Meier, W. eXist - Open Source Native XML Database. 2006 [cited; Available from: http://exist.sourceforge.net. Cocoon, A. The Apache Cocoon Project. 2006 [cited; Available from: http://cocoon.apache.org/. Dijkstra, E., Notes on Structured Programming, A. Press, Editor. 1972. W3C. Extensible Markup Language (XML) 1.0 (Fourth Edition). 2006 [cited; Available from: http://www.w3.org/TR/2006/REC-xml-20060816/. Wood, D., Theory of Computation, ed. J.W. Sons. 1987. Info, R.E. Regular Expressions Info. 2007 [cited; Available from: http://www.regularexpressions.info/. Hopcroft, J. and J. Ullman, Introduction to Automata Theory, Languages, and Computation, ed. Addison-Wesley. 1979. Ferreira, R. and J. Moura-Pires. Extensible Metadata Repository for Information Systems and Enterprise Applications. in ICEIS 2007 - 9 th International Conference on Enterprise Information Systems. 2007. Funchal, Portugal. Ferreira, R., Extensible Metadata Repository for Information Systems, in Departamento de Informática. 2007, UNL/FCT – Universidade Nova de Lisboa / Faculdade de Ciências e Tecnologia: Caparica. Physics, I.o.E. Lomnicky Stit Neutron Monitor. 2006 [cited; Available from: http://neutronmonitor.ta3.sk/. NYSEData. NYSE Index Weightings. 2006 [cited; Available from: http://www.nysedata.com/nysedata/Default.aspx?tabID=154. Program, E.H. Latest Earthquakes: Feeds & Data. 2006 [cited; Available from: http://earthquake.usgs.gov/eqcenter/recenteqsww/catalogs/. Hawaii, U.o. MODVOLC - Near Real Time Thermal Monitoring of Global Hot-Spots. 2006 [cited; Available from: http://modis.higp.hawaii.edu/cgi-bin/modis/modisnew.cgi. Harvard, U.o. List Of The Potentially Hazardous Asteroids (PHAs). 2006 [cited; Available from: http://cfa-www.harvard.edu/iau/lists/PHACloseApp.html. Pantoquilho, M., et al. SEIS: a decision support system for optimizing spacecraft operations strategies. in IEEE Aerospace Conference. 2005. Montana, USA. Space, D. Deimos Space. 2006 [cited; Available from: http://www.deimos-space.com. INTA. 2006 [cited; Available from: http://www.inta.es/. CA3. CA3 - Soft Computing and Autonomous Agents. 2006 [cited; Available from: http://www2.uninova.pt/ca3/. Winer, D. Really Simple Syndication (RSS) 2.0 Specification. 2002 [cited; Available from: http://blogs.law.harvard.edu/tech/rss. W3C. XML Schema Part 0: Primer Second Edition. 2004 [cited; Available from: http://www.w3.org/TR/xmlschema-0/. Pantoquilho, M.B., A Space Environment Information System for Mission Control Purposes: System Analysis and Data Integration Design, in Departamento de Informática. 2005, UNL/FCT – Universidade Nova de Lisboa / Faculdade de Ciências e Tecnologia: Caparica. Oracle. Warehouse Builder. 2006 [cited; Available from: http://www.oracle.com/technology/products/warehouse/index.html. Technologies, S.E. Space Environment Technologies. 2006 [cited; Available from: http://www.spacewx.com. Park, C. SOHO Proton Monitor. 2006 [cited; Available from: http://umtof.umd.edu/pm/. Index, W.D.C.f.t.S. SIDC - Daily Sunspot Index. 2006 [cited; Available from: ftp://omaftp.oma.be/. Magnetism, W.D.C.f. World Data Center for Magnetism (Kyoto). 2006 [cited; Available from: http://swdcdb.kugi.kyoto-u.ac.jp/.

Annexes SESS Data Processing Module Requirements Table 8.1: DPM global requirements Description

Priority

SESS shall connect to the mission generic infrastructure as an external entity. Must SESS will process TM already converted to engineering values (i.e., TM Should previously unpacked and calibrated by the Ground Program Control - GPC). For satellites providing TM in near real-time, SESS shall perform real time TM Should processing. SESS will be able to connect to the GPC from any location having internet Must capabilities. SESS will take the TM data from the GIOVE-B mission web server via FTP.

Should

For the GIOVE-B mission, SESS will be located at the facilities proposed by Must ESA. The SESS will be able (after acceptance by the GPC Administrator of its User Access Rights) to retrieve from the GPC Web Site Catalogue the daily files Must related to the Standard Radiation Monitor (SREM) Calibrated TM. SE information shall be obtained directly from external Space Effects (SE) Must centres. SE and Effects Database shall allow uploading SE data from a broad range of Must external data providers. SE and Effect Database shall include Navigation and Telecom SE anomalies Must data.

Table 8.2: ETD Engine requirements Description

Priority

The ETD engine shall be implemented using Java (J2SE 5.0).

Must

The ETD engine shall be accessed as a web-service.

Must

The ETD engine shall process all space-weather data in near real-time.

Must

The ETD engine shall process space-weather data with the highest priority.

Should

The ETD engine shall process SREM Calibrated TM with medium priority.

Should

The ETD engine shall process historical data with low priority.

Should

The ETD engine shall provide enough information for data tracing purposes.

Should

Whenever the ETD engine detects an error due to the change on a provided file format, the system administrator shall be notified by email, attached with Would the input and corresponding FFD. The ETD engine shall send to DIM all processed data, using a Web Service for Must communication. ETD metadata shall be retrieved from the Metadata Repository, using a Web Should Service for communication. The ETD engine shall optimise the processing of large and very large files.

Must

The ETD engine shall run on Microsoft Windows 2000 / XP.

Must - 163-

Table 8.3: FFD Editor requirements Description

Priority

The FFDE shall be implemented using Java (J2SE 5.0).

Must

The FFDE shall be a desktop tool, i.e. not running through a web interface.

Must

The FFDE shall have two modes of interaction with the end user: free Should definition (without automatic support) and a wizard based definition. The FFDE shall create File Format Definitions (FFDs) through user annotations Should on a sample file, also provided by the user. The FFDE shall enable the specification of a set of validation rules for data Must quality purposes for all parameters. The FFDE shall enable to perform data type validation for all parameters.

Must

The FFDE shall be able to process a set of sample files for determining the Should correctness of the File Format Definition. FFDE metadata shall be retrieved from the Metadata Repository, using a Web Should Service for communication. The FFDE shall run on Microsoft Windows 2000 / XP.

Must

Table 8.4: FR requirements Description

Priority

The FR shall be implemented using Java (J2SE 5.0).

Must

The FR shall be a server side application.

Must

The FR shall visualize the log information, enabling user operations like Search Would and Filter by. The FR shall provide visualization for all active threads and their current Would status. The FR shall provide a specialized editor for managing Data Service Provider Should and Provided File instances. The FR shall be able to retrieve data files through a HTTP connection.

Must

The FR shall be able to retrieve data files through a FTP connection.

Must

The FR shall be able to retrieve data files from the 3M forecast application.

Must

The FR shall store in a compressed folder all the files retrieved from external Should sources. The FR shall log information regarding the scheduling, download and dispatch Should to process for every file. The FR shall be able to retrieve data from S/W data providers: NOAA/SEC, NOAA/NGDC, Kyoto World Data Center, SIDC, Soho Proton Monitor, Would Lomnicky’s Peak Neutron Monitor and Space Weather Technologies (providers used in SEIS project). The FR shall retrieve on daily basis SREM Calibrated TM from the GPC Web Would Site Catalogue (FTP Server). The FR shall enable the loading of historical files into the SESS system.

Should

The FR shall enable to Start / Stop individual file acquisition without restarting Should the application. The FR shall enable to Start / Stop all file acquisition from a Data Service Should Provider without restarting the application.

- 164 -

The FR shall provide enough information for data tracing purposes.

Should

The FR shall send to the ETD engine all downloaded files for processing, using Must a Web Service for communication. FR metadata shall be retrieved from the Metadata Repository, using a Web Should Service for communication. The FR shall run on Microsoft Windows 2000 / XP.

Must

- 165-

Available Transformation Operations Table 8.5: Available transformation operations Name

Description

AddDate

Adds a constant value to a date.

AppendConstant

Appends to an existing value a constant string.

CreateConstant

Creates a new constant value.

CreateConstantDate

Creates a new constant date value.

CreateDate

Creates a new date value.

DateConvert

Converts an input date value to a different output date format.

DeleteDateRows

Deletes all rows matching a given date criteria.

DeleteInvalidDateRows

Deletes all rows having an invalid date value.

DeleteStringRows

Deletes all rows matching a given string criteria.

Distribute

Distributes the string content of a column over a second one.

Duplicate

Duplicates a field as many times as the size of the input field.

GetElement

Returns an element given its index.

Join

Joins the content of two input values into a third output value.

Map

Maps a value to a new one.

Merge

Merges two columns into one.

OrderDateRows

Orders the content of two columns - by row - given a date criteria.

PadChar

Pads - left or right - a given string with a given character until a specified length is attained.

RemoveChar

Removes all instances of a specified character.

Replace

Replaces the values of a input field matching a given criteria.

Split

Splits a value in two parts, given a separator scheme.

- 166 -

Regular Expression Library (XML Instance) Date / Time

- 167-

SESS File Format Definition Statistics Table 8.6: SESS File Format Definition statistics FFD Name 27do 3daypre 45df Ace_epam (Realtime) Ace_epam (Summary) Ace_mag (Realtime) Ace_mag (Summary) Ace_pkp Ace_sis (Realtime) Ace_sis (Summary) Ace_swepam (Realtime) Ace_swepam (Summary) AK (Realtime) AK (Summary) Boumag (Realtime) Boumag (Summary) Crn dpd dsd dssn events G10pchan (Realtime) G10pchan (Summary) G10xr (Realtime) G10xr (Summary) G11pchan (Realtime) G11pchan (Summary) G12pchan (Realtime) G12pchan (Summary) G12xr (Realtime) G12xr - 168 -

2 21 3 3

2 20 5 5

27 58 33 35

Data Deliveries 1 42 2 8

3

5

35

8

8

3

5

33

6

6

3

5

33

6

6

3 3

2 5

53 35

3 2

3 2

3

5

35

2

2

3

5

33

3

3

3

5

33

3

3

3 3 3

12 12 2

195 195 32

72 72 2

16 16 2

3

2

32

2

2

1 3 3 1 3 3

1 2 2 1 35 5

33 26 1 11 800 33

1 6 11 3 27 11

1 6 11 3 137 11

3

5

33

11

11

3

5

33

2

2

3

5

33

2

2

3

5

33

11

11

3

5

33

11

11

3

5

33

11

11

3

5

33

11

11

3

5

33

2

2

3

5

33

2

2

Sections

Fields

Transforms

Identifiers 1 14 2 8

(Summary) Geoa Goes 10 Magnetic Components (Realtime) Goes 10 Magnetic Components (Summary) Goes 12 Magnetic Components (Realtime) Goes 12 Magnetic Components (Summary) Iono_Beijing Iono_Boulder Iono_Hobart Iono_Jicamarca Iono_Juliusruh Iono_Ksalmon Iono_Kwajalein Iono_Learmonth Iono_LouisVale Iono_Magadan Iono_Sanvito Iono_Wallops Kyoto_dst Lomnicky Peak Meanfld Mssn Pmsw S2k Sgas Solar Radio Flux Data Srs Sunmaxmin Thule neutron (Realtime) Thule neutron (Summary) 3M SW Integral Propagator TM Integral Total

5 3

12 2

51 33

8 4

8 4

3

2

33

4

4

3

2

33

4

4

3

2

33

4

4

3 3 3 3 3 3 3 3 3 3 3 3 33 1 2 1 2 2 7 4

5 2 2 2 2 2 2 2 2 5 2 2 34 1 2 2 1 8 13 14

33 33 33 34 34 34 33 33 33 38 33 33 1548 19 45 22 12 34 123 41

1 15 15 15 15 15 15 15 15 1 15 15 1 2 1 2 5 2 6 63

1 15 15 15 15 15 15 15 15 15 15 15 1 2 1 2 5 2 6 45

5 4 3

11 1 6

37 22 32

7 1 1

7 1 1

3

6

32

1

1

27 5

40 3

87 20

61 9

69 9

1 259

1 381

3 4762

1 695

113 781

- 169-

extraction and transformation of data from semi ...

computer-science experts interactions, become an inadequate solution, time consuming ...... managing automatic retrieval and data extraction from text files.

5MB Sizes 1 Downloads 253 Views

Recommend Documents

semi-automated extraction of channel morphology from lidar terrain data
Interferometric Synthetic Aperture Radar (IFSAR) data were used to build two high resolution ... A GIS-based alternative to conventional field surveys for the estimation of ... energy associated with water moving through the channel and any ..... gre

Scalable Attribute-Value Extraction from Semi ... - PDFKUL.COM
huge number of candidate attribute-value pairs, but only a .... feature vector x is then mapped to either +1 or −1: +1 ..... Phone support availability: 631.495.xxxx.

A Real Time Data Extraction, Transformation and ...
custom developed software tools capable of addressing short term necessities .... A Real Time Data ETL Solution for Semi-structured Text Files. 385 ... Near real-time visualization of ongoing Space Weather and Spacecraft conditions through ...

Scalable Attribute-Value Extraction from Semi ... - Research at Google
258 web pages, our best candidate generation and filtering ... analysis to help classify the candidates generated [10, 4, 2]. ..... Web hosting plan name: Starter.

Automated Extraction of Date of Cancer Diagnosis from EMR Data ...
Automated Extraction of Date of Cancer Diagnosis from EMR Data Sources. Jeremy L. Warner, M.D., M.S.1,2, Lucy Wang B.S.3, Ravi Atreya B.S.2, Pam Carney ...

OntoDW: An approach for extraction of conceptualizations from Data ...
OntoDW: An approach for extraction of conceptualizations from Data Warehouses.pdf. OntoDW: An approach for extraction of conceptualizations from Data ...

Real-time RDF extraction from unstructured data streams - GitHub
May 9, 2013 - This results in a duplicate-free data stream ∆i. [k.d,(k+1)d] = {di ... The goal of this step is to find a suitable rdfs:range and rdfs:domain as well ..... resulted in a corpus, dubbed 100% of 38 time slices of 2 hours and 11.7 milli

Automated data extraction from the web with ...
between the degree of automation and the performance and also provide a ... Associate Professor (1991) at the Institute of Information Technology, National.

TEXTLINE INFORMATION EXTRACTION FROM ... - Semantic Scholar
because of the assumption that more characters lie on baseline than on x-line. After each deformation iter- ation, the distances between each pair of snakes are adjusted and made equal to average distance. Based on the above defined features of snake

Unsupervised Features Extraction from Asynchronous ...
Now for many applications, especially those involving motion processing, successive ... 128x128 AER retina data in near real-time on a standard desktop CPU.

Extraction of temporally correlated features from ...
many applications, especially those involving motion processing, successive frames contain ... types of spiking silicon retinas have already been successfully built, generally with resolution of ...... In Electron devices meeting. IEDM. 2011 IEEE.

Digit Extraction and Recognition from Machine Printed ...
Department of Computer Science, Punjabi University, Patiala, INDIA ... presents a survey on Indian Script Character .... processing, Automatic data entry etc.

Text Extraction and Segmentation from Multi- skewed Business Card ...
Department of Computer Science & Engineering,. Jadavpur University, Kolkata ... segmentation techniques for camera captured business card images. At first ...

Shared Task on Source and Target Extraction from ...
To evaluate diverse tools, the research community relies on the existence of ... In the best case, the source and target correspond to semantic .... Twitter Sentiment Analysis (Nakov et al., 2013); SemEval-2010 task 18: Disambiguating sentiment ...

3. MK8 Extraction From Reservoir.pdf
Try one of the apps below to open or edit this item. 3. MK8 Extraction From Reservoir.pdf. 3. MK8 Extraction From Reservoir.pdf. Open. Extract. Open with.

TEXTLINE INFORMATION EXTRACTION FROM ... - Semantic Scholar
Camera-Captured Document Image Segmentation. 1. INTRODUCTION. Digital cameras are low priced, portable, long-ranged and non-contact imaging devices as compared to scanners. These features make cameras suitable for versatile OCR related ap- plications

Textline Information Extraction from Grayscale Camera ... - CiteSeerX
INTRODUCTION ... our method starts by enhancing the grayscale curled textline structure using ... cant features of grayscale images [12] and speech-energy.

Extraction of Key Words from News Stories
Work Bench [13] as an annotation interface tool. ... tagging by annotators, we did some cleaning up to make sure there ..... ture as seen in the training data. Thus ...

A semi-empirical model of the contribution from ...
In addition, we constrain the initial input through a comparison of our modeled results with ... noctilucent cloud particles in the polar mesopause region. [von Zahn et al., ..... series of closed form analytic solutions for the determina- tion of th

Extraction of nanocellulose fibrils from banana, pineapple leaf fibre ...
Extraction of nanocellulose fibrils from banana, pineap ... F), jute and coir fibres by steam explosion process.pdf. Extraction of nanocellulose fibrils from banana, ...

pdf data extraction services
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. pdf data extraction services. pdf data extraction services. Open.