Decision Support Systems 2011/2012 Week 1. Lecture 1
Outline Course presenta,on Decision Support Systems An Overview
Decision Support Systems: The Course
Faculty Francisco Melo (fmelo@inesc- id.pt) Office hours: Monday, 9h30 11h00, Wednesday, 14h00 15h30 (S. Polivalente, Pav. Informá,ca II) ContacAng me by e- mail: E- mail mainly for logis,c issues Otherwise, use office hours For news, keep an eye on the webpage: hsp://fenix.ist.utl.pt/disciplinas/sad/2011-2012/1- semestre/
Bibliography Main: J. Han, M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufman Publishers, 2 nd edi,on, 2001. (there is a new edi7on of 2011, but I ll s7ck to the previous one)
Bibliography (cont.) Auxiliary: R. Kimball, M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. Wiley Computer Publishing, 2nd edi,on, 2002. T. Mitchell. Machine Learning, McGraw Hill, 1997. J. Smola and B. Scholkopf. Learning with Kernels: Support Vector Machines, Regulariza7on, Op7miza7on and Beyond. MIT Press, 2002.
Classes Lectures: Slide presenta,ons of textbook material (slides will be available online) Lecture notes (when needed) (will also be available online) Lab Sessions: Groups of 3 (preferrably) SQL Server 2008 Exercises and prac,cal tasks Begin on September 26 th (but keep an eye on the webpage)
Grading Three components: Data warehousing project: 30% of final grade Consists of 4 homeworks that follow lab sessions Data mining project: 30% of final grade Consists of 4 homeworks that follow lab sessions Final examina,on: 40% of final grade
Grading (cont.) The final grade is given by: NF = 0.3 NDW + 0.3 NDM + 0.4 NE To pass the course you must verify all condi,ons below: NDW 9.5 NDM 9.5 NE 9.5 All grades are posted in the course website
Grading (cont.) Projects/homework assignments should be completed in groups of three students Project grades will be given individually upon discussion, if deemed necessary by the faculty Although discussion between groups is allowed, it should always be kept in general terms. Students are not allowed to show, share or discuss specific soluaons, neither physically nor electronically. Similarly, you may consult references (both in paper and online) for ideas about how to tackle specific problems. However, soluaons delivered should result from original work by the students.
Important Dates Data- warehousing Project: 1. Issued: Sep. 25 Due: Oct. 3 (at the end of lecture) 2. Issued: Oct. 2 Due: Oct. 10 (at the end of lecture) 3. Issued: Oct. 9 Due: Oct. 17 (at the end of lecture) 4. Issued: Oct. 16 Due: Oct. 24 (at the end of lecture)
Important Dates Data- mining Project: 1. Issued: Oct. 23 Due: Oct. 31 (at the end of lecture) 2. Issued: Nov. 6 Due: Nov. 14 (at the end of lecture) 3. Issued: Nov. 20 Due: Nov. 28 (at the end of lecture) 4. Issued: Dez. 4 Due: Dez. 12 (at the end of lecture)
Important Dates ExaminaAons: Jan. 07, 2012 Jan. 31, 2012 (Recurso)
Syllabus Introduc,on (Chap. 1) Data pre- processing (Chap. 2) Data warehousing (Chap. 3) Mul,dimensional data model Data warehouse architecture Online analy,cal processing (OLAP) Data cube computa,on (Chap. 4)
Syllabus (cont.) PaSern mining (Chap. 5) Itemset mining Mining associa,on rules Clustering (Chap. 7) k- means Hierarchical methods Expecta,on- maximiza,on Supervised learning (Chap. 6) Decision tree learning Bayesian learning Learning sets of rules Ar,ficial neural networks Support vector machines Model selec,on
Decision Support Systems: What Is This All About? or DSS in 60 minutes
Databases: Storing and Accessing Data Database: A computerized system to maintain informa,on and make it available on demand. Database Management System Database Applica,on programs End users
Example (Rel. Database) Primary key Suppliers S# SNAME STATUS CITY S1 Smith 20 London S2 Jones 10 Paris S3 Blake 30 Paris S4 Clark 20 London S5 Adams 30 Athens Parts ASributes P# PNAME COLOR WEIGHT CITY P1 Nut Red 12 London P2 Bolt Green 17 Paris P3 Screw Blue 17 Rome P4 Screw Red 14 London P5 Cam Blue 12 Paris P6 Cog Red 19 London Shipments S# P# QTY S1 P1 300 S1 P2 200 S1 P3 400 S1 P4 200 S1 P5 100 S1 P6 100 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 S4 P4 300 S4 P5 400 Rela,on Tuple
Example (Query) Suppliers S# SNAME STATUS CITY S1 Smith 20 London S2 Jones 10 Paris S3 Blake 30 Paris S4 Clark 20 London S5 Adams 30 Athens P# PNAME COLOR WEIGHT CITY P1 Nut Red 12 London P2 Bolt Green 17 Paris P3 Screw Blue 17 Rome P4 Screw Red 14 London P5 Cam Blue 12 Paris P6 Cog Red 19 London Parts Shipments S# P# QTY S1 P1 300 S1 P2 200 S1 P3 400 S1 P4 200 S1 P5 100 S1 P6 100 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 S4 P4 300 S4 P5 400 Which city has the largest amount of shipping suppliers? SELECT PX.CITY, MAX(PX.TOTAL) FROM ( SELECT CITY,SUM(QTY) AS TOTAL FROM SUPPLIERS NATURAL JOIN PARTS NATURAL JOIN SHIPMENTS GROUP BY CITY) AS PX;
Another (Larger) Example Customer CUSTID NAME ADDRESS AGE INCOME CREDITCAT CUSTCAT C1 Maria Silva Av. Liberdade, n. 123 31 60,000.00 1 3 Item ITEMID NAME BRAND CATEGORY TYPE PRICE MADEIN SUPPLIER COST I3 hi- res- tv Mochiba High resolu,on TV 988.00 Japan NikoX 600.00 I8 laptop Cell Laptop computer 1,369.00 USA Cell 983.00 Employee EMPID NAME CATEGORY GROUP SALARY COMMISSION E55 Santos, Manuel home entretainment manager 100,000.00 2% WorksAt Purchases EMPID BRANCHID TRANSID CUSTID EMPID DATE TIME PAYMENT AMOUNT E55 B1 T100 C1 E55 21/03/2011 14:35 VISA 1,357.00 Branch BRANCHID NAME ADDRESS B1 Colombo Av. Lusíada ItemsSold What is the pair of items most TRANSID ITEMID QTY frequently bought together per T100 I3 1 branch/,me of year?
InterpreAng Data In the presence of huge amoungs of data, how can we extract informaaon that is: non- trivial implicit previously unknown poten,ally useful
InterpreAng Data (cont.) Interpreta,on of data can benefit from: Informa,on- friendly ways to represent data Data Warehousing Automated methods to extract informa,on Data mining
RepresenAng Data
RepresenAng Data Examples of data representa,on: Month Net profit Jan. 10,974.00 Feb. 5,944.00 Mar. 4,846.00 Apr. 2,056.00 May 2,250.00 Jun. 3,896.00 Jul. 3,366.00 Aug. 4,936.00 Sep. 4,786.00 Oct. 3,000.00 Nov. 3,566.00 Dec. 2,376.00
Database Example Sales NAME COLOR SIZE QTY skirt dark S 2 skirt dark M 5 skirt dark L 1 skirt pastel S 11 skirt pastel M 9 skirt pastel L 15 skirt white S 2 skirt white M 5 skirt white L 3 dress dark S 2 dress dark M 6 dress dark L 12 dress pastel S 4 dress pastel M 3 dress pastel L 3 dress white S 2 dress white M 3 dress white L 0 shirt dark S 2 shirt dark M 6 shirt dark L 6 pants white L 2
Sales Database Example NAME COLOR SIZE QTY skirt dark S 2 skirt dark M 5 skirt dark L 1 Measure aaributes: ASributes that measure some quan,ty and can be aggregated upon. skirt pastel S 11 skirt pastel M 9 skirt pastel L 15 skirt white S 2 skirt white M 5 skirt white L 3 dress dark S 2 dress dark M 6 dress dark L 12 dress pastel S 4 dress pastel M 3 dress pastel L 3 dress white S 2 dress white M 3 dress white L 0 shirt dark S 2 shirt dark M 6 shirt dark L 6 pants white L 2
Sales Database Example NAME COLOR SIZE QTY skirt dark S 2 skirt dark M 5 skirt dark L 1 Measure aaributes: ASributes that measure some quan,ty and can be aggregated upon. skirt pastel S 11 skirt pastel M 9 skirt pastel L 15 skirt white S 2 skirt white M 5 skirt white L 3 dress dark S 2 Dimension aaributes: ASributes that define dimensions on which measure asributes can be viewed. dress dark M 6 dress dark L 12 dress pastel S 4 dress pastel M 3 dress pastel L 3 dress white S 2 dress white M 3 dress white L 0 shirt dark S 2 shirt dark M 6 shirt dark L 6 pants white L 2
Database Example (cont.) SIZE NAME all COLOR dark pastel white TOTAL skirt 8 35 10 53 dress 20 10 5 35 shirt 14 7 28 49 pants 20 2 5 27 TOTAL 62 54 48 164 This is a cross- tabula,on. Is is not a relaaon! Sales NAME COLOR SIZE QTY skirt dark S 2 skirt dark M 5 skirt dark L 1 skirt pastel S 11 skirt pastel M 9 skirt pastel L 15 skirt white S 2 skirt white M 5 skirt white L 3 dress dark S 2 dress dark M 6 dress dark L 12 dress pastel S 4 dress pastel M 3 dress pastel L 3 dress white S 2 dress white M 3 dress white L 0 shirt dark S 2 shirt dark M 6 shirt dark L 6 pants white L 2
AggregaAon Data can be aggregated across different dimensions: COLOR NAME all SIZE S M L TOTAL skirt 15 19 19 53 dress 8 12 15 35 shirt 23 8 18 49 pants 18 6 3 27 TOTAL 64 45 55 164
AggregaAon (cont.) Data can be aggregated at different granularity: 2 5 3 1 11 COLOR dark pastel white all 4 7 6 12 29 2 8 5 7 22 8 20 14 20 62 34 35 10 7 2 54 21 10 8 28 5 48 77 53 35 49 27 164 all skirt dress shirt pants all NAME 4 9 42 large 16 18 45 small medium 3- dimensional cuboid
Sales NAME COLOR SIZE QTY skirt dark S 2 skirt dark M 5 skirt dark L 1 skirt pastel S 11 skirt pastel M 9 skirt pastel L 15 skirt white S 2 skirt white M 5 skirt white L 3 dress dark S 2 dress dark M 6 dress dark L 12 dress pastel S 4 dress pastel M 3 dress pastel L 3 dress white S 2 dress white M 3 dress white L 0 shirt dark S 2 shirt dark M 6 shirt dark L 6 pants white L 2 AggregaAon (cont.) We can define a hierarchy over asribute values along specific dimensions NameCat NAME skirt dress shirt pants CATEGORY womenswear womenswear menswear menswear NAME CATEGORY Less general (more asribute values are possible) More general
AggregaAon (cont.) Data can be aggregated at different resolu,on: CATEGORY SIZE S M L TOTAL womenswear 23 31 34 88 menswear 41 14 21 76 TOTAL 64 45 55 164 Roll- up NAME SIZE S M L TOTAL skirt 15 19 19 53 dress 8 12 15 35 shirt 23 8 18 49 pants 18 6 3 27 TOTAL 64 45 55 164 Drill- down
Data Warehousing Provides architectures and tools to: Organize Understand Use data, assis,ng in strategic decision- making
Analyzing Data
Branch BRANCHID NAME ADDRESS B1 Colombo Av. Lusíada Analyzing data Customer CUSTID NAME ADDRESS AGE INCOME CREDITCAT CUSTCAT C1 Maria Silva Av. Liberdade, n. 123 31 60,000.00 1 3 Item ITEMID NAME BRAND CATEGORY TYPE PRICE MADEIN SUPPLIER COST I3 hi- res- tv Mochiba High resolu,on TV 988.00 Japan NikoX 600.00 I8 laptop Cell Laptop computer 1,369.00 USA Cell 983.00 Employee EMPID NAME CATEGORY GROUP SALARY COMMISSION E55 Santos, Manuel home entretainment manager 100,000.00 2% WorksAt Purchases EMPID BRANCHID TRANSID CUSTID EMPID DATE TIME PAYMENT AMOUNT E55 B1 T100 C1 E55 21/03/2011 14:35 VISA 1,357.00 ItemsSold Is there any rela,on between TRANSID ITEMID QTY customers incomes and the average T100 I3 1 price of laptops they purchase?
Branch BRANCHID NAME ADDRESS B1 Colombo Av. Lusíada Analyzing data Customer CUSTID NAME ADDRESS AGE INCOME CREDITCAT CUSTCAT C1 Maria Silva Av. Liberdade, n. 123 31 60,000.00 1 3 Item ITEMID NAME BRAND CATEGORY TYPE PRICE MADEIN SUPPLIER COST I3 hi- res- tv Mochiba High resolu,on TV 988.00 Japan NikoX 600.00 I8 laptop Cell Laptop computer 1,369.00 USA Cell 983.00 Employee EMPID NAME CATEGORY GROUP SALARY COMMISSION E55 Santos, Manuel home entretainment manager 100,000.00 2% WorksAt Purchases EMPID BRANCHID TRANSID CUSTID EMPID DATE TIME PAYMENT AMOUNT E55 B1 T100 C1 E55 21/03/2011 14:35 VISA 1,357.00 ItemsSold TRANSID ITEMID QTY StaAsAcs is our friend! T100 I3 1
Analyzing Data (cont.) Analyzing data beyond obvious is hard Some challenges: Incomplete data Noisy data Inconsistent data
Example: Outlier Missing value? What about noise?
Example: Noisier data (trend less clear)
Example:
ExtracAng Useful InformaAon What if you want to predict your missing value? Missing value?
Extract Useful InformaAon (cont.) We can use the observed data to build a model and use it to predict:
Machine Learning Machine learning/data mining: Discipline that devises methods to extract useful informa,on (rela,ons) from data The more data you have, the more you can learn
But Accoun,ng for the noise in the model can really make a difference!
If You Had to Predict which of the two predicaons would you prefer? WHY?
There s No Free Lunch! You always assume something about the data LEARNING BIAS
Data Mining Analyze data to extract useful (implicit) informa,on: Frequent paserns [HK, Chap. 5] Clusters [HK, Chap. 7] Rela,ons (func,ons) [HK, Chap. 6] [HK] J. Han, M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufman Publishers, 2 nd edi,on, 2001.
That s It