Wednesday, June 10, 2009

Introduction to Data Mining

What is Data Mining?

Today Data is abundant around us. With the drastic improvement in performance and reduced hardware costs,computers have become ubiquitous. The number of people using internet is increasing in more than exponential rate. As a consequence of all these, the data is getting accumulated in an unimaginable rate. This massive data makes the traditional methods of analyzing the data almost worthless. Traditional methods usually involve analysing the data, record by record which will consume exponential amount of time even with today's modern computers. So what is the use of storing this astronomical data if we cannot find any useful information out of it?



Here comes our savior to address this problem, “Data Mining” . Data mining is the process of mining useful information out of this massive data by application of intelligent algorithms and machine learning techniques.

Oh ! That sounds cool. Can you tell me more about data mining? some cases where the data mining techniques can be applied?

I will be glad to provide you more information. See have you ever visited amazon site? What do see in the bottom of the page when you are looking at the details of a particular book? Ah yes ! You got it right, the site recommends other books which you may like and it also shows up a list of books saying people who bought the book you are looking for also bought the ones in the list. Have you ever thought how is the site able to make such recommendations for every book that you choose? Its very simple, the recommendation system is built on the data mining techniques which analyzes customer behavioral data and recommends you based on the information it obtained from the analysis.

A bank loan officer would like to analyze if a new loan application is safe or risky. A marketing executive would like to check if a newly launched product could be sold to a particular customer. A sales executive would like to find among his ongoing deal discussions which would get signed up in the coming week. A medical researcher would be eager to know given a DNA sample, if belongs to a cancer affected person. All these type of problems boil down to classifying things. “ Classification” is a branch of data mining which deals with these kinda problems and provides
various methods to address typical classification problems.

A marketing executive would like to know which products are usually brought together. An Earthquake research scientist would like to find what type of geographical disturbances usually occur together during an earthquake. A car manufacturer would like to know which spare parts of a car get damaged together. This type of analysis is called “Association analysis” and is also referred as market-basket analysis or frequent item set mining.

A bank may like to partition its customer base based on the factors such as income and the balance they maintain. A Marketing executive may like to partition the customers based on the products bought, so that he can target the marketing of a newly launched product only towards the particular cluster of customers who are likely to buy that product. This approach is commonly known as targeted marketing, which helps to save lot of marketing expenses and time. “Cluster Analysis” is the section which deals which these kinda problems.

Have you ever come across this situation? You don't use your credit card for a long time, say months together and suddenly you make a purchase which is for a largest amount you have ever purchased. You immediately get a call from your credit card customer care executive and he checks with you that the recent purchase is valid,you are aware of it and your card is safe. Why does this happen? There is a anomaly in your credit card usage pattern and there is a system in place to detect it. A Nuclear Power station administrator would be interested in knowing, any anomaly in normal operation as soon as it arises. A network security Professional would like to detect any network intrusions as soon as the network is hacked. The section of data mining which deals with detecting anomalies from large data sets is known as “Anomaly detection” or “Outlier Analysis”.

The areas mentioned here are only few examples of application of data mining techniques and data mining, per se is not restricted to these sections. There are more areas which are under active research such as Evolutionary analysis, Temporal data analysis, Web mining, Text mining, etc..

And thats it for the Introduction part. Watch this space. There is more to come. I am planning to write articles on techniques specific to each section listed above. I would be glad to receive your comments and questions are always welcome !

12 comments:

  1. hey that's good rudiment post for beginner's who urge to know about data mining... Good job...

    ReplyDelete
  2. Good job Venki. Interesting, waiting for the next article.

    -Jawad

    ReplyDelete
  3. Good Job machi.. Keep it up...

    ReplyDelete
  4. Good to know about this. Venky, can you elaborate on what is teradata. I come across this term recently. Is it related to data storage?

    ReplyDelete
  5. @Kalyan,
    teradata is a database like Oracle RDBMS. Its mostly used in datawarehousing environments.
    you can find more data in
    www.teradata.com

    ReplyDelete
  6. Hi da Venki..Good to know abt these facts.. I will keep following it up..keep posting..

    ReplyDelete
  7. Hi,
    Nice article

    Cheers
    Ramana

    ReplyDelete
  8. I am impressed the way u started...
    really a very nice start keep going ...:)

    Ravi.

    ReplyDelete
  9. Do keep posting such topics.! You elaborated it very smoothly so that a 7 standard can understand. keep going!

    ReplyDelete
  10. Completely agree..Data mining can be defined as the process harvesting and discovering useful and valuable information through the analysis of enormous amounts of data found in databases, websites or data warehouses through the use of a number of techniques such as artificial intelligence, statistical and machine learning. It is a relatively a new and promising technology..

    Introduction to Data Mining Processes

    ReplyDelete
  11. Completely agree..Data mining can be defined as the process harvesting and discovering useful and valuable information through the analysis of enormous amounts of data found in databases, websites or data warehouses through the use of a number of techniques such as artificial intelligence, statistical and machine learning. It is a relatively a new and promising technology..

    Introduction to Data Mining Processes

    ReplyDelete