RDD : Resilient Distributed Dataset

Apache Spark has a key feature known as Resilient distributed datasets(RDD’s). The data structure available in Spark.

RDD’s are fault tolerant and can be operated in parallel.

RDD’s can be created by:-

  • Parallel execution of collection available in driver program.
  • By referencing dataset in an external storage system such as HDFS, HBase or data source supporting Hadoop input format.

RDD’s support two types of operations:

  1. Transformations
  2. Actions

Transformations generate a new dataset from already available dataset.

Actions return values to driver after working on dataset.

Unlike Map-Reduce, Spark does not carry out the complete life-cycle of data processing for task completion.

Spark is efficient and operates on datasets only when results are required by driver program.

CAP Theorem and Distributed System

CAP Theorem is also known as Brewer’s Theorem.

CAP refers to Consistency, Availability and Partition tolerance.

  • Consistency – All nodes in a cluster have same data at any particular time.
  • Availability – It is the end result of request i.e. success or failure.
  • Partition tolerance – As per this feature, the system continues to operate despite any hardware or network failures.

According to CAP Theorem any distributed system can exhibit only 2 out of the above 3 features.

Most distributed systems experience network failures, hence partition tolerance has to be met by distributed systems.

For the remaining one condition to be met:

Databases designed based on ACID properties choose consistency over availability.

Note: Consistency here not to be confused with the one in ACID.

On the other hand, databases designed based on BASE properties choose availability over consistency.

HBase

In case of HBase, consistency and partition tolerance are met. But HBase does not offer availability. HBase is a NoSQL database.

Google Analytics for Jan-Feb-March 2018(Q4 )

The statistics for MOHANMA.COM for Jan-Feb-March(starting Jan 26th to 31st March) are as shown above.

There have been a total of 737 sessions with 1673 pageviews.

The Bounce Rate of the website is 66.76%, which is pretty good as lower the bounce rate better the performance of website.

New visitors account for 76.5% of total visitors and returning visitors account for 23.5%.

India accounts for the highest traffic, with Karnataka being the leading region to in-flowing traffic with 576 sessions so far.

Among devices, mobile phones lead the traffic generated followed by desktop and lastly by tablets.

Apple iphones are the leading mobile phones used to access the website followed by Xiaomi and Motorola phones.

Thanks to everyone who have been visiting the website.

Love,

Mohan M A

Data: The new “OIL” of Information Era

Dataoil

Anything and everything that can be produced, which we can quantify is referred as data. In simple words granular information is data. Automobiles and machinery have been running on oil extracted from earth. In the Internet age devices, machines and all mundane activities shall be driven by data.

All data that exists can be classified into three forms:

  1. Structured Data
  2. Semi-structured Data
  3. Unstructured Data

Structured Data – Structured data is a standardized format for providing information. Examples: Sensor data, machine generated data, etc.

Semi-structured Data – Semi structured Data does not exist in standard form but can be used to derive structured form with little effort. Examples: JSON, XML, etc.

Unstructured Data – Unstructured Data is any data that is not organized but may contain data which can be extracted. Examples: Social media data, Human languages, etc.

Most global tech giants operate from data generated by their users. Google, Amazon, Facebook, Uber, et al. come under the same umbrella. The insights derived from structured and semi structured data can help us in decision making. The magnitude and scale at which these companies generate data is astounding.

Databases play a very important role in storing data. But traditional databases are no longer a choice to store data in today’s fast moving world. New age file systems and infrastructure have started operating to cater the demands of ever expanding Internet space.

In the human world, the voice, speech, text, walking speed, everything can be classified as unstructured data, since we can derive a lot of insights from them. A mobile device per individual is pretty much sufficient to analyse the behavior of a sizable population in a region.

Data collected from a population for a relatively considerable time can be used to derive patterns about the population. Hence, data is the driving force which will fuel innovation and economy from here.

Google Analytics for MOHANMA.COM : Launch(26th Jan) to 2nd March

Firstly, thanks to everyone who have visited mohanma.com. The above image shows the global footprint of mohanma.com.

The total number of sessions so far has been 563. India leads in web traffic with 493 sessions. Followed by United States of America, Australia and United Kingdom. Traffic has been flowing in from different parts of Europe, Canada, Peru, China, Japan, etc as well.

To be frank, I have been surprised with the kind of response that mohanma.com has been receiving from day 1. Thanks to all the readers, well-wishers who have been pouring in their feedback via WhatsApp, Facebook and the comments section on website.

Love,

Mohan M A