Understanding Apache Spark

In my last blog post I had discussed about data, now let us understand a modern tool to process huge datasets(BigData) so as to extract insights from data.

Apache Spark – A fast and general engine for large-scale data processing. Spark is a more sophisticated version of data processing engine compared to engines using MapReduce model.

One of the key feature with Apache Spark is Resilient distributed datasets(RDD’s). These are data structures available in Spark.

Spark can run on Hadoop YARN cluster. The biggest advantage to keep large datasets in memory adds to the capability of Spark over MapReduce.

Applications types that find Spark’s processing model helpful are:

  1. Iterative algorithms
  2. Interactive analysis

Other areas which make Spark more adoptable are:

Spark DAG(Directed Acyclic Graph) – This component of the engine helps to convert variable number of operations into a single job.

User Experience – Spark makes user experience smooth by having a plethora of API’s to perform data processing tasks.

Spark has API’s in these languages: Scala, Java, Python and R.

Spark programming comprising of the Spark shell(also known as Spark CLI or Spark REPL) makes it simple to work on datasets. REPL stands for read-eval-print loop.

Spark on the other hand provides modules for:

  1. Machine learning(MLib) – Provides a framework for distributed machine learning.
  2. Graph processing(Graphx) – Provides a framework for distributed graph processing.
  3. Stream processing(Spark Streaming) – Helpful for streaming(real-time) analytics. Data ingestion takes place in mini-batches and RDD transformations are performed upon these mini-batches.
  4. SQL(Spark SQL) – Provides data abstraction known as SchemaRDD which supports structured and semi-structured data.

These components operate on Spark core. Spark core provides platform for in memory computing and referencing datasets in external storage systems.

Companies that are using Apache Spark – Google, Facebook, Twitter, Amazon, Oracle, et al.

Spark services are provided on notable cloud platforms such as Google Cloud Platform(GCP), Amazon Web Services(AWS) and Microsoft Azure.

Source: Apache Spark

 

Data: The new “OIL” of Information Era

Dataoil

Anything and everything that can be produced, which we can quantify is referred as data. In simple words granular information is data. Automobiles and machinery have been running on oil extracted from earth. In the Internet age devices, machines and all mundane activities shall be driven by data.

All data that exists can be classified into three forms:

  1. Structured Data
  2. Semi-structured Data
  3. Unstructured Data

Structured Data – Structured data is a standardized format for providing information. Examples: Sensor data, machine generated data, etc.

Semi-structured Data – Semi structured Data does not exist in standard form but can be used to derive structured form with little effort. Examples: JSON, XML, etc.

Unstructured Data – Unstructured Data is any data that is not organized but may contain data which can be extracted. Examples: Social media data, Human languages, etc.

Most global tech giants operate from data generated by their users. Google, Amazon, Facebook, Uber, et al. come under the same umbrella. The insights derived from structured and semi structured data can help us in decision making. The magnitude and scale at which these companies generate data is astounding.

Databases play a very important role in storing data. But traditional databases are no longer a choice to store data in today’s fast moving world. New age file systems and infrastructure have started operating to cater the demands of ever expanding Internet space.

In the human world, the voice, speech, text, walking speed, everything can be classified as unstructured data, since we can derive a lot of insights from them. A mobile device per individual is pretty much sufficient to analyse the behavior of a sizable population in a region.

Data collected from a population for a relatively considerable time can be used to derive patterns about the population. Hence, data is the driving force which will fuel innovation and economy from here.

Bengaluru and the spirit of Entrepreneurship

Here is a list of reasons for Bengaluru being a favorite location to start-up:

1 – Technology Infrastructure :- High end systems, high speed internet, ask for anything new in the field of technology, it’s usually first implemented in Bengaluru. That gets Bengaluru a premium tag.

2 – Workforce & Talent :- A plethora of highly qualified professionals makes it a easy choice for most of the entrepreneur’s to choose Bengaluru to kick-start their venture. Engineer’s and Business professionals are a new class of the neatly woven fabric in Bengaluru.

3 – Geographic location :- Weather in Bengaluru is usually pleasant with room temperature hovering between 23°C to 27°C throughout the year with summer being an exception. Bengaluru is the king among metropolitan cities of India, when it comes to weather.

4 – Koramangala :- The Heart of Entrepreneurship for the whole nation is located at koramangala. Visit any street in this locality of Bengaluru and you shall encounter a new venture or business being built. Koramangala is a place that every entrepreneur must visit at-least once in their lifetime.

5 – Government support :- The Karnataka Government has been one of the most proactive government in the country as far as Entrepreneurship is considered. Some of the highly successful Information Technology & Bio-technology ventures started off from Bengaluru by the initial support offered by Karnataka Government. ITPL and many SEZ’s in the city are examples for it.

6 – Cosmopolitan Nature :- This is the ultimate reason for Entrepreneurship to be cherished in Bengaluru. People from different parts of the country live and work in Bengaluru, surely Bengaluru is a “Melting pot”.

7 – Cost of living :- Of late the costs have been going up in Bengaluru considering inflation and other factors. But it’s still affordable for middle class to survive. Cost of living is comparatively cheaper than Delhi and Mumbai.

8 – Education :- The city hosts a series of exceptional institutions, IISc and IIM being the cream. Even at secondary and higher secondary level there are many noteworthy institutions.

9 – Law and Order :- Law & order is usually stable with few huff and puff at times. Law & order is one more reason to choose Bengaluru as a business location.

10 – Healthcare :- Many healthcare bodies are spread-out through the city, there are specialized units in the city where foreigner’s are addressed too. Healthcare facilities are cheaper in Bengaluru compared to International standards.

Bengaluru is the first Indian city to have it’s own logo.

BE”ngalur“U”, BE and U in Bengaluru stands for “Be You”.

Google Analytics for MOHANMA.COM : Launch(26th Jan) to 2nd March

Firstly, thanks to everyone who have visited mohanma.com. The above image shows the global footprint of mohanma.com.

The total number of sessions so far has been 563. India leads in web traffic with 493 sessions. Followed by United States of America, Australia and United Kingdom. Traffic has been flowing in from different parts of Europe, Canada, Peru, China, Japan, etc as well.

To be frank, I have been surprised with the kind of response that mohanma.com has been receiving from day 1. Thanks to all the readers, well-wishers who have been pouring in their feedback via WhatsApp, Facebook and the comments section on website.

Love,

Mohan M A

Difference between Process and Thread in Java

 

ProcessThread
1) Part of Concurrent Programming.1) Integral part of Concurrent Programming.
2) A process has a self-contained execution environment. 2) A thread is a lightweight process.
3) Process has its own memory space.3) Threads share the process's resources, including memory and open files.
4) Processes are synonymous with programs or applications.4) Threads do things such as memory management and signal handling.
5) A Java application can create additional processes using a ProcessBuilder object.5) A Main thread has the ability to create additional threads.
6) To facilitate communication between processes, most operating systems support Inter Process Communication (IPC) resources, such as pipes and sockets.6) Threads exist within a process. This makes for efficient, but potentially problematic, communication.

Processing time for a single core is shared among processes and threads through an OS feature called time slicing.

Time Slice or Preemption : A technique to implement multitasking in operating systems. The period of time for which a process is allowed to run in a preemptive multitasking system is generally called the time slice or quantum. The scheduler is run once every time slice to choose the next process to run.

Source: 1) The Java™ Tutorials
        2) Wikipedia

Object-Oriented Programming (OOP)

OOP is based on “objects“. An object can perform a set of functions. The functions that the object performs defines the object’s behavior.

For example, the Batsman(object) can play shots or a Bowler(object) can Bowl.

Languages that support object-oriented programming have four important aspects:

1. Abstraction
2. Encapsulation
3. Inheritance
4. Polymorphism

Abstraction involves the process to define objects that represent abstract actions which can perform work,
report on and change their state, and interact with other objects in the system.

Encapsulation is the process of enclosing data and functions into a single element.

Inheritance is when an object or class is based on another object,
using parent objects properties to maintain the same behavior in the child object.

Polymorphism is the method of having a single interface to objects of different types.

Object-oriented programming (OOP) is a programming paradigm

Objects — > data — > fields — > attributes — > methods(code)

Some of the popular languages that support Object-Oriented Programming

Java, Python, C++, C#, Perl, Ruby , Objective-C, PHP, et al.

Why I chose India ? ‘The perquisite of choosing India over America’

Just like any other kid who completes engineering, I too had plans of pursuing masters in America.

I had the test scores of GRE and TOEFL, got few letters of recommendation, got all my transcripts. I was at the verge of applying to universities in US.

Initially, I had plans of joining masters program in Spring 2012(as I completed my Engineering in 2011). But postponed the joining date to Fall 2012 as the number of universities offering masters course would be more in Fall 2012.

I was fascinated with the idea of working with/studying cutting edge technologies in US.

When I started my engineering in 2007, the head of physics department professor H N Sathyamurthy  had introduced me to the concept of higher education in US(before recession 2008), where there was plethora of opportunities if you worked hard. I am thankful to him for inculcating such ideals.

So the thought of going over to US was a prolonged one given the time period between ideation and pursuing masters.

Fast forward December 2011, I had two choices, either go over to US for masters or stay back in India with a comfortable lifestyle. The stakes were high if I was going over to US, the opportunity was a gamble.

I chose staying back in India. I was complacent initially, for not chasing the american dream.

But 3-6 years down the line I have realized the essence and advantage of staying back in India.

The biggest advantage of staying back has been the fact that you can be for your family and vice-versa.

A lot of technological advancements that used to be limited to US and developed countries have started making their presence in India as well. Cutting edge technology and premier research work is now available in different metro cities of India. Maybe not full fledged, but they have started having a decent visibility.

Given today’s political and visa norms in US, I find it was a very good decision to stay back in India.

Also, the kind of tech atmosphere that India is offering today can be contemplated to be among the best in the world.

When it comes to Internet startups, US has started reaching it’s saturation point. India offers a level playground and equal opportunity with net neutrality for tech startups.

The crowd in India is open to new ideas which was not the case say about 8-10 years ago.

Food – being a gourmet I find myself fortunate to eat as much Indian dishes as possible.

Healthcare – way cheaper compared to US even for a small ailment such as common cold.

The grass is always greener on the other side. It would be naive of me if I had believed only in “American Dream”.

How and What I learnt being an Entrepreneur ?

It all started with the iAccelerator program of IIM-Ahmedabad in 2010, when most of my batch-mates and peers were busy preparing for semester exams, I was at IIIT-Bangalore pitching my first startup plan.

Please find the screenshot of feedback I am sharing below.

Even-though my pitch couldn’t make it that year, the sense of clearing the application round and being invited to IIIT-B in itself was a big boosting success for me.

This was at a time when startup was not a buzzword in India(pre smartphone era), startups in India had just started keeping their baby steps.

I am a first generation Entrepreneur.

Initially entrepreneurship was just an attraction(2009-sophomore year), I remember there was a delegate who was in our college as part of student interaction program from USA, he was explaining a hall of 100-150 students about the trends in US where students dropout to start their own website/company(Facebook was 5 year old startup at this point). I had announced to the hall and him saying “you got it mate, I will have a website” with lots of adrenaline rush.

Professors in my college were very supportive at this stage, I have discussed a lot of my ideas with professor Pushpalatha K N(Linear Integrated Circuits and Applications) who was always encouraging me. Also, professor Deepa N P(micro-controllers) was keen on helping me setting up a web portal way back in 2009. And not to forget professor Nirmal Kumar Benni whom I could meet only during my final year was supportive as well.

I just wanted to start a website to begin with, later I wanted it to be a networking platform, on further working I wanted it to be an educational platform, this was all on the paper.

Years later, practically I started with a blogging platform, later upgraded to an E-commerce platform which was further brought under incorporation.

Currently, I own 3 websites, 1 personal(mohanma.com), 2 brands(sumshowdone.com & bookcorridor.com) under 1 company(Sumshowdone Technologies Private Limited).

I had my first job offer through campus placement in my final year of Engineering(2011). The thought of starting a company was always there, I had come close in 2013 but could never materialize plan into action that year. I owned my first website in 2014(sumshowdone.com), became a online merchant in 2015, incorporated a company in 2016, owned my second website(bookcorridor.com) in 2017 and owned my third website in 2018(mohanma.com – this was spontaneous).

Startup landscape has seen a huge shift compared to 2010, the number of users in the industry that I operate has seen an upward lateral shift.

The kind of people that I have got to meet in this field is something different from a corporate life.

Entrepreneurship gives you freedom, with freedom comes great responsibility.

Value creation is the biggest form of reward that an entrepreneur yearns for, if tomorrow you want  to be an entrepreneur, you should be concentrating on a product or service that can change somebody’s life or make the buyer a happier person.

Things that I learnt being an entrepreneur:

  • If you want to do it you just do it(gut feeling).
  • Face your fears.
  • Never postpone, time is a very important resource.
  • Reading plays a very important role in shaping your thought process.
  • Failures are stepping stones to success.

The amount of knowledge and learning you acquire being an entrepreneur is tremendous.

An Entrepreneur should be ready to wear all hats from being a leader to a office boy.

Each one of us is knowingly or unknowingly an entrepreneur at some point. I believe at the end it’s all about your temperament, your adaptability combined with serenity.

Algorithms in Java

Most algorithms in Java operate on list instances and subsequently with other collection instances. The different algorithms used in Java are:

  • Sorting
  • Shuffling
  • Routine Data Manipulation
  • Searching
  • Composition
  • Finding Extreme Values

Sorting

The sort algorithm reorders a List so that its elements are in ascending order according to an ordering relationship. Two forms of the operation are provided.

FORM 1

The simple form takes a List and sorts it according to its elements’ natural ordering.

The sort operation uses a optimized merge sort algorithm which is fast and stable:

  • Fast: It is guaranteed to run in n log(n) time and runs substantially faster on nearly sorted lists. Empirical tests showed it to be as fast as a highly optimized quicksort. A quicksort is generally considered to be faster than a merge sort but isn’t stable and doesn’t guarantee n log(n) performance.
  • Stable: It doesn’t reorder equal elements. This is important if you sort the same list repeatedly on different attributes. If a user of a mail program sorts the inbox by mailing date and then sorts it by sender, the user naturally expects that the now-sequential list of messages from a given sender will (still) be sorted by mailing date. This is guaranteed only if the second sort was stable.

FORM 2

The second form of sort takes a Comparator in addition to a List and sorts the elements with the Comparator.

Shuffling

The shuffle algorithm does the opposite of what sort does, destroying any trace of order that may have been present in a List. That is, this algorithm reorders the List based on input from a source of randomness such that all possible permutations occur with equal likelihood, assuming a fair source of randomness. This algorithm is useful in implementing games of chance.

This operation has two forms:

One takes a List and uses a default source of randomness, and the other requires the caller to provide a Random object to use as a source of randomness.

Routine Data Manipulation

The Collections class provides five algorithms for doing routine data manipulation on List objects, as listed below:

  • reverse — reverses the order of the elements in a List.
  • fill — overwrites every element in a List with the specified value. This operation is useful for reinitializing a List.
  • copy — takes two arguments, a destination List and a source List, and copies the elements of the source into the destination, overwriting its contents. The destination List must be at least as long as the source. If it is longer, the remaining elements in the destination List are unaffected.
  • swap — swaps the elements at the specified positions in a List.
  • addAll — adds all the specified elements to a Collection. The elements to be added may be specified individually or as an array.

Searching

The binarySearch algorithm searches for a specified element in a sorted List. This algorithm has two forms.

  1. The first takes a List and an element to search for (the “search key”). This form assumes that the List is sorted in ascending order according to the natural ordering of its elements.
  2. The second form takes a Comparator in addition to the List and the search key, and assumes that the List is sorted into ascending order according to the specified Comparator. The sort algorithm can be used to sort the List prior to calling binarySearch.

The return value is the same for both forms.

Composition

The frequency and disjoint algorithms test some aspect of the composition of one or more Collections:

  • frequency — counts the number of times the specified element occurs in the specified collection.
  • disjoint — determines whether two Collections are disjoint; that is, whether they contain no elements in common.

Finding Extreme Values

The min and the max algorithms return, respectively, the minimum and maximum element contained in a specified Collection. Both of these operations come in two forms.

The simple form takes only a Collection and returns the minimum (or maximum) element according to the elements’ natural ordering.

The second form takes a Comparator in addition to the Collection and returns the minimum (or maximum) element according to the specified Comparator.

Source: The Java™ Tutorials

 

Java Collection Framework

A collection or container is an object that groups multiple elements into a single unit, they are used to store, retrieve, manipulate and aggregate data.

Collections Framework

A collections framework is a unified architecture for representing and manipulating collections. All collections frameworks contain the following:

  • Interfaces: Interfaces are abstract data types that represent collections. Interfaces allow collections to be manipulated independently of the details of their representation. In object-oriented languages, interfaces generally form a hierarchy.
  • Implementations: These are the concrete implementations of the collection interfaces. Implementations are reusable data structures.
  • Algorithms: Algorithms are the methods that perform useful computations, such as searching and sorting, on objects that implement collection interfaces. The algorithms are said to be polymorphic: that is, the same method can be used on many different implementations of the appropriate collection interface. Algorithms possess reusable functionality.

Interfaces

The  collection interfaces encapsulate different types of collections. These interfaces allow collections to be manipulated independently of the details of their representation.

  • Collection — the root of the collection hierarchy. A collection represents a group of objects known as its elements. The Collection interface is the least common denominator that all collections implement and is used to pass collections around and to manipulate them when maximum generality is desired. Some types of collections allow duplicate elements, and others do not. Some are ordered and others are unordered. The Java platform doesn’t provide any direct implementations of this interface but provides implementations of more specific subinterfaces, such as Set and List.
  • Set — a collection that cannot contain duplicate elements.
  • SortedSet — a Set that maintains its elements in ascending order. Several additional operations are provided to take advantage of the ordering. Sorted sets are used for naturally ordered sets, such as word lists and membership rolls.
  • List — an ordered collection (sometimes called a sequence). Lists can contain duplicate elements. The user of a List generally has precise control over where in the list each element is inserted and can access elements by their integer index (position).
  • Queue — a collection used to hold multiple elements prior to processing. Besides basic Collection operations, a Queue provides additional insertion, extraction, and inspection operations.

Queues usually order elements in a FIFO (first-in, first-out) manner.

  • Deque — Similar to queue but Deques can be used both as FIFO (first-in, first-out) and LIFO (last-in, first-out). In a deque all new elements can be inserted, retrieved and removed at both ends.
  • Map — an object that maps keys to values. A Map cannot contain duplicate keys; each key can map to at most one value.
  • SortedMap — a Map that maintains its mappings in ascending key order. This is the Map analog of SortedSet. Sorted maps are used for naturally ordered collections of key/value pairs, such as dictionaries and telephone directories.

If an unsupported operation is invoked, a collection throws an UnsupportedOperationException.

Note:  All core collection interfaces are generic.

Source: The Java™ Tutorials