A data file is any file containing information, but not code. It is only intended to be read or displayed. This web page, for example, a letter that you write in a word processor, and a text file are all called data files.In this article, let us discuss Data file formats in Big data,More info go through big data hadoop course.
Types of Data File Formats
You can use the following four different file formats.
- Text files
- Sequence Files
- Avro data files
- Parquet file format
Let us get more detailed information about each file format.
The most simple and human-readable file is Text file. In any programming language, it can be read or written, and is mostly delimited by comma or line.
When a numeric value needs to be stored as a string, the text file format uses more space. Representing binary data like an image, too, is difficult.
You can use it to store an image in binary format using the sequence file format. They store pairs of key-value pairs in a binary container format, and are more powerful than a text file. Sequence files are nonetheless not human-readable.
Avro Data Files
Due to optimized binary encoding Avro file format has ample capacity. It is widely supported within the Hadoop community and beyond.
Ideal for long-term storage of essential data, the Avro file format. It can read and write in many languages such as Java, Scala etc.
Schema metadata can be inserted in the file to ensure it’s readable at all times. Schema evolution is capable of adapting to changes.
The Avro file format in Hadoop is considered to be the best alternative for general purpose storage.
Parquet File Format:
Parquet is a columnar format developed by both Twitter and Cloudera. Supports it in Spark, MapReduce, Hive, Pig, Impala, Crunch, etc. As with Avro, the file embeds schema metadata.
Parquet file format uses advanced improvements that are defined in the Dremel paper by Google. Such enhancements are reducing storage space and increasing efficiency.
This file format for parquet is known to be the most powerful to add multiple records at a time. You can have many optimizations on recurring trends.
Data File formats in Apache spark
Apache Spark is a cluster computing system that operates on Hadoop, handling different data types. It is a one-stop solution to many problems as Spark has a wealth of data handling tools and, most importantly, it is 10–20 faster than MapReduce from Hadoop. It achieves this processing speed by its simple in-memory. The data is stored in the memory (RAM) and performs all the in-memory computations.
Apache Spark supports a wide range of data formats, including the popular CSV format and the convenient JSON Web format. Apache Parquet and Apache Avro are common formats used primarily for large-data analysis.
In this article, using Apache Spark, Let us look at the properties of these spark supported file formats.
- Parquet and
CSV in Apache spark
You usually use CSV in apache spark files (comma-separated values) to exchange tabular data using plain text between systems. CSV is a file format based on a row, meaning every row of the file is a row in the table. In general, CSV includes a header row containing data column names, otherwise files are known to be partially organized. Initially, CSV files can not contain Hierarchical or Relational data. Multiple CSV files are typically used to create data connections. International keys are stored in one or more file columns but the format itself does not express the relations between those files. Additionally, the CSV format is not fully standardized, and files can use separators other than commas, such as tabs or spaces.
Some of the other properties of CSV in apache spark files are divisible only when it is a new, uncompressed file or when splittable compression format such as bzip2 or lzo is used (note: lzo needs to be indexed to be splittable.)
- CSV in apache spark is manually readable and easy to edit.
- CSV offers a straightforward scheme.
- You can process CSV across virtually all current applications.
- CSV can be quickly applied and parsed.
- CSV is lightweight. You start a tag for XML and terminate a tag for each column in each row. In CSV, the headers for columns are only written once.
- CSV lets you work on flat results. Complex data structures must be interpreted separately from the format; column forms are not allowed.
- No difference between text and numeric columns.
- No consistent way of displaying binary data; CSV import problems (for example, no difference between NULL and quotes.
- Weak support for special characters.
- Lack of a standardized consistent.
CSV files are a common option for data sharing despite limitations and problems, because they are supported by a wide variety of industry, user and science applications. Similarly, most batch and streaming frameworks (e.g. Spark and MR) initially support CSV file serialization and deserialization, and provide ways to add a schema while reading.
JSON in Apache spark
As several data are already distributed in JSON format, JSON is initially supported in most web languages. With this massive support, JSON is used for describing data structures, hot data exchange formats, and cold data warehouses.
Many streaming packages and modules help serialize and deserialize Data. You can store the data contained in JSON documents. These can eventually be stored in more performance-optimized formats like Parquet or Avro. If appropriate, these data serve as raw data.
- JSON supports hierarchical structures, simplifies the storage of related data in a single document, and addresses complex relationships,
- Most languages provide simpler JSON serialization libraries or built-in JSON serialization or deserialization support.
- JSON in apache spark supports object lists, helping to prevent disorderly list conversion to relational data.
No support for namespace, and therefore low extensibility.
Support for development tools limited. It provides support to interpretation of formal grammar.
Parquet file format in Apache spark:
Since the data is stored in columns, it can be highly compressed (compression algorithms work better with low information entropy data, which is usually contained in columns) and separated. Software developers say this storage system is suitable for solving problems with Big Data.
Parquet files, unlike CSV and JSON, in apache spark are binary files which contain metadata about their contents. Spark can therefore simply rely on metadata to decide column names, compression or encoding, data types and even some basic statistical characteristics without reading or parsing the contents of the file(s). At the end of the file, column metadata is stored for a Parquet file, which allows quick, single pass writing.
Parquet is adapted to the Write Once Read Most (WORM) model. It writes slowly but reads extremely quickly, particularly when only a subset of columns is accessed. Parquet is a safe choice when reading parts of the data for heavy workloads. In cases where you need to deal with entire rows of data, a format such as CSV or AVRO should be used.
Advantages of parquet data storage:
- Parquet in apache spark is a columnar format. Only the correct columns are retrieved or read, and this reduces the I or O disk.
- The concept is called projection pushdown.
- Data flows with the system and data is self-describing.
- Just parquet files which means that they can be quickly accessed, transferred, backed up and replicated.
- Built-in support in Spark makes file access and access simple.
The column-based nature makes you think about the scheme and data types. Parquet doesn’t always have built-in support in software other than Spark; it doesn’t support data alteration parquet files are immutable and scheme evolution. Of course, if you change the schema over time, Spark knows how to merge it you must define a special option when reading, but you can only change something in an existing file by overwriting it.
Pushdown predicate or filter pushdown
The basic idea of pushdown predicate is that some parts of queries (predicates) can be “pushed” to where the data is stored. If we offer certain filtering requirements, for example, the data storage tries to filter out the records at the time of reading. The downside of predicate pushdown is that there are fewer operations on disk I or o and hence overall performance is higher. Otherwise all data will have to be written to memory and then stored, resulting in higher memory requirements.
By filtering the data earlier than later, this optimization will significantly reduce the request or processing time. Depending on the processing system, the pushdown predicate will optimize the query by performing acts such as filtering data before it is transmitted over the network, filtering data before it is loaded into memory, or skipping reading entire files or pieces of information.
Most DBMS, as well as Big Data storage formats like Parquet and ORC follow this definition.
Pushdown projection Predicate Pushdown or Filter Pushdown
Only reads those required columns when reading data from data storage, and not all fields will be read. This principle is typically followed by column formats such as Parquets and ORC, resulting in better I or O results.
Avro in Apache spark:
The Hadoop working group published Avro Apache Avro. It is a row-based format with a high dividing degree. It is also defined as a Java Serialization like data serialization framework. The database is stored in JSON format, while the data is stored in binary format which reduces file size and optimizes performance. Avro’s management of added, disabled, and modified fields has reliable support for schema evolution. This helps old software to read new data, and new software to read old data if your data will alter it is a critical feature.
Avro’s ability to handle scheme evolution allows components to be autonomously changed at various times, with low risk of incompatibility. It eliminates the need for developers to write if-else statements to handle different schema versions, and removes the need for the developer to look at old code to understand the old schema. Since all versions of the schema are stored in a human-readable JSON header, all the fields you have at your fingertips are easy to understand.
Avro in apache spark is a relatively compact choice for both permanent storage and wire transfer, since the schema is stored in JSON and the data is stored in binary form. Since Avro is a row-based format, it is the preferred format to handle large numbers of data, as new rows can be easily added.
Avro is a neutral-linguistic serialization of results.
Avro stores the schema in a file header, so the data is self-describing; simple and quick data serialization and deserialization, which can provide very good ingestion performance. Avro files often include synchronization markers to distinguish blocks as with the sequence files. This makes it highly splittable. Avro-formatted files are splittable and compressible, and are therefore a good candidate for data storage in the Hadoop ecosystem. The scheme used to read Avro files does not necessarily have to be the same as the one used to write the files. It allows for individual addition of new fields.
- This must be used by. NET 4.5. To make the best use of it in the case of C # Avro.
- Serialisation is theoretically slower.
- Need a schema to read or write data.
I hope you reach to a conclusion about data file format in Apache spark. You can learn more about Apache spark and other big data concepts with Big data online training.