Hadoop application architectures pdf download






















Book description Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. Show and hide more. Table of contents Product information. Architectural Considerations for Hadoop Applications 1. Why Care About Metadata? Where to Store Metadata? What Is Graph Processing? Conclusion 6. Case Studies 8. ISBN: Get it now. An example of a SequenceFile using block compressionSerialization FormatsSerialization refers to the process of turning data structures into byte streams either forstorage or transmission over a network.

Conversely, deserialization is the process ofconverting a byte stream back into data structures. Serialization is core to a distributedprocessing system such as Hadoop, since it allows data to be converted into a formatthat can be efficiently stored as well as transferred across a network connection. The main serialization format utilized by Hadoop is Writables. Writables are compactand fast, but not easy to extend or use from languages other than Java. Of these, Avro is the best suited, since itwas specifically created as a replacement for Writables.

ThriftData Storage Options 7. Thrift was developed at Facebook as a framework for implementing cross-languageinterfaces to services. Notethat there are externally available libraries such as the Elephant Bird project to addressthese drawbacks, but Hadoop does not provide native support for Thrift as a data storageformat.

Like Thrift, protobuf structuresare defined using an IDL, which is used to generate stub code for multiple languages. Also like Thrift, Protocol Buffers do not support internal compression of records, arenot splittable, and have no native MapReduce support.

But also like Thrift, the ElephantBird project can be used to encode protobuf records, providing support for MapReduce,compression, and splittability. AvroAvro is a language-neutral data serialization system designed to address the majordownside of Hadoop Writables: lack of language portability. Like Thrift and ProtocolBuffers, Avro data is described using a language independent schema. Unlike Thrift andProtocol Buffers, code generation is optional with Avro.

Avro also provides betternative support for MapReduce since Avro datafiles are compressible and splittable. This makes it possible to add new fieldsto a schema as requirements change. As noted above, the schema is stored as part of the file metadata inthe file header. In addition to metadata, the file header contains a unique sync marker. Just as with SequenceFiles, this sync marker is used to separate blocks in the file, allowingAvro files to be splittable.

Following the header, an Avro file contains a series of blockscontaining serialized Avro objects. These blocks can optionally be compressed, andwithin those blocks, types are stored in their native format, providing an additional8 Chapter 1: Data Modeling in Hadoop.

Avro defines a small number of primitive types such as boolean, int, float, string, andalso supports supports complex types such as array, map, and enum. Columnar formatsUntil relatively recently, most database systems stored records in a row-oriented fashion. This is efficient for cases where many columns of the record need to be fetched. Forexample, if your analysis heavily relied on fetching all fields for records that belongedto a particular time range, row-oriented storage would make sense.

This can also bemore efficient when writing data, particularly if all columns of the record are availableat write time since the record can be written with a single disk seek. Works well for queries that only access a small subset of columns. If many columnsare being accessed, then row-oriented is generally preferable.

Compression on columns is generally very efficient, particularly if the column hasfew distinct values. Columnar storage is often well suited for data-warehousing type applications whereusers want to aggregate certain columns over a large collection of records.

Not surprisingly, columnar file formats are also being utilized for Hadoop applications. Columnar file formats supported on Hadoop include the RCFile format, which has beenpopular for some time as a Hive format, as well as newer formats such as ORC andParquet.

RCFiles are similar to SequenceFiles, except data is stored in a columnoriented fashion. The RCFile format breaks files into row splits, then within each splituses column oriented storage.

Newer columnar formats such asORC and Parquet address many of these deficiencies, and for most newer applicationsData Storage Options 9. RCFile is still a fairly common format used withHive storage. Allows predicates to be pushed down to the storage layer so that only required datais brought back in queries.

Supports the Hive type model, including new primitives such as decimal as well ascomplex types. Is a splittable storage format. A drawback of ORC as of this writing is that it was designed specifically for Hive, andso is not a general purpose storage format that can be used with non-Hive MapReduceinterfaces such as Pig or Java, or other query engines such as Impala.

Work is underwayto address these shortcomings though. ParquetParquet shares many of the same design goals as ORC, but is intended to be a generalpurpose storage format for Hadoop.

Designed to support complex nested data structures. Parquet stores full metadata at the end of files, so Parquet files are self-documenting.

Comparing Failure Behavior for Different File FormatsAn important aspect of the various file formats is failure handling; some formats handlecorruption better than others: Columnar formats, while often efficient, do not work well in the event of failure,since this can lead to incomplete rows.

Sequence files will be readable to the first failed row, but will not be recoverableafter that row. Avro provides the best failure handling; in the event of a bad record, the read willcontinue at the next sync point, so failures only effect a portion of a file. CompressionCompression is another important consideration for storing data in Hadoop, not justin terms of reducing storage requirements, but also in order to improve performanceof data processing.

This includes compression of source data, butalso the intermediate data generated as part of data processing, e. MapReduce jobs. Since the MapReduce framework splitsdata for input to multiple tasks, having a non-splittable compression codec provides animpediment to efficient processing.

If files cannot be split, that means the entire fileneeds to be passed to a single MapReduce task, eliminating the advantages of parallelismand data locality that Hadoop provides.

SnappySnappy is a compression codec developed at Google for high compression speeds withreasonable compression. Processing performance withSnappy can be significantly better than other compression formats. The final book is currently scheduled for release in April and will be available. Home Hadoop Application Architectures - Bigda Hadoop Application Architectures - Bigdata Handson. Export To Word. Last View : Today. Last Download : 1m ago. Upload by : Giovanna Wyche. Report this link.

Related Books. Installing Hadoop 2. I renamed the download to something easier to type-out later. Make this hduser an owner of this directory just to be sure.

Now that we have hadoop, we have to configure it before it can launch its daemons i. Hadoop on Amazon — Elastic MapReduce How to create instance on Amazon EC2 2. How to connect that Instance Using putty 3. Installing Hadoop framework on this instance 4. Run sample wordcount example which come with Hadoop framework. Hadoop Learning Resources 1. Hadoop Real-World Solutions Cookbook. Chapter 1: Getting Started with Hadoop 2.

X 1 Introduction1 Installing single-node Hadoop Cluster 2 Installing a multi-node Hadoop cluster 9 Adding new nodes to existing Hadoop clusters 13 Executing balancer command for uniform data distribution 14 Entering and exiting from the safe mode in a Hadoop cluster 17 Decommissioning DataNodes SamsTeachYourself Hadoop.

Installation of Hadoop on Ubuntu. For this purpose, we start with editing.



0コメント

  • 1000 / 1000