Global Journal of Computer Science and Technology, C: Software & Data Engineering, Volume 22 Issue 2
o SFTP : Here it is observed which the transmission of input data is through the SFTP transfer protocol, this transmission is done through Control M which is a Batch programming tool, this tool serves the entire data lake provisioning life cycle. o HDFS : In this table we have 3 layers of provisioning Staging: Temporary storage layer where the files reside since its origin, for pilot purposes it was homologated to csv files. The temporality of these files should be the minimum possible because the information can contain the sensitive data (in the following layers, there is the possibility of encryption) and it should also be temporary because the flat files take up a lot of space that must be used by other consumable formats. by users in Big Data. Raw: It is the raw data layer in the Big Data environment, this layer contains compressed files at the row level (AVRO format) for subsequent transformations, but not for queries and advanced analytics. This layer also serves as historical storage for its disk savings and also serves as a backup against possible reprocesses of the next layer. Master: Final layer of the Data Lake for which it is compressed at the columnar level with the aim to support exploratory analysis and processing of analytical models, this file format is PARQUET. This layer is the one that Data Scientists have access to in their respective Sandboxes. Figure 2: Comparation of Formats of Files Source: (Plase et al., 2017) As it´s well-known in the previous image, Avro and Parquet stand out among the other file formats considering the integration of data structures and compression support; these characteristics are consistent with the manipulation of data in a Big Data environment. o Nexus Repository : It is the repository of artifacts where store all the configuration files used in the feeding of files from the different layers of the Data Lake, as well as the libraries that are consumed in complex processing. requests (communication between servers via HTTP). Integration of the Big Data Environment in a Financial Sector Entity to Optimize Products, Services and Decision-Making Global Journal of Computer Science and Technology Volume XXII Issue II Version I 43 Year 2022 ( ) C © 2022 Global Journals Actually, the replacement of this component with JFrog Artifactory, an Open Source tool with the same functionality as Nexus but that complies more efficiently with the installed ecosystem. o Automation Api : It is the processing engine and clustering based on the Spark-based. It is distributed by a series of containers (Docker) and agents (Nodes) that distribute the cores efficiently to execute the different jobs that provision files in the Data Lake. As can be seen in Figure 1, in this layer storages the jobs (which will be used for information processing, such as feeding for the Data Lake), which are created and executed through rest
RkJQdWJsaXNoZXIy NTg4NDg=