(rudra@wisc.edu) b. Shantanu - GitHub

Viewer
Transcript

1. The   names   and   emails   of   members   in   your   group a. Sneha   Rudra   ([email protected]) b. Shantanu   Singhal   ([email protected]) c. Zirui   Tao   ([email protected]) 2. List   the   schema   of   the   two   tables This   was   the   original   schema   of   the   two   tables   TableA

TableB

SNo,   Album,   SongName,   Artist,   Time, Price,   Genres,   Released,   Rights

Id,Album,Genres,ASIN,Label,Time,Rating ,TrackName,Price,Artist,Released

The   scheme   of   the   two   tables   was   transformed   to   be   as   described   in   the   next   step.   The updated   CSV   files   under   Project   Stage   1  on   the   Project   Website. 3. List   the   attributes   in   the   set   S S   =  {‘Id’,   ‘Album’,   ‘Genre’,   ‘Label’,   ‘Time’,   ‘Track’,   ‘Price’,’Artist’,   ‘Released’} The   common   scheme   has   9  attributes. 4. For   each   attribute   X  in   S,   consider   only   Table   A  and   discuss   the   following   in   the report a. The   missing   values   as   percentage   and   fractions,   classifications   of   each attribute   and   the   min,   max   and   average   length   of   the   various   textual attributes   is   discussed   in   the   table   below

Type

Missing   Values Fraction

Length

Percentage

Avg

Min

Max

Id

numeric

0

Album

textual

0

31.86

1

90

Genres

categorical

0

Label

textual

1.79

31.26

13

141

54   /  3019

Time

Time

0

Track

textual

0

26.59

1

104

Price

money

34   /  3019

1.13

Artist

textual

0

20.15

4

85

Released

Date

1   /  3019

  0.03

b. Proposed   solutions   for   missing   values :  matrix   completion   and/or   replacing with   flag   values   or   average   values. c. Show   at   least   two   histograms   that   your   team   has   created.   Find   and   report possible   outliers   and   anomalies   among   the   attribute   values We   picked   two   attributes   to   plot   the   histogram   on   our   table:   the   “Time”   attributes and    “Album”   attributes.   Since   Album   belongs   to   the   textual   type.   We   plot   its corresponding   length   distribution   on   histogram.

Based   on   the   histograms   that   we   created   for   track   time   and   album   name   we’ve identified   the   following   outliers   for   each Track   Duration   Length 1. 00:34:29   (id   =  2786)

Album   Name   Length 1.   9 0   (id   =  2980)

2.     00:34:25   (id   =  2911)

d. If   the   attribute   value   is   supposed   to   follow   a  certain   format   (e.g.,   dates), then   discuss   if   all   values   follow   the   same   format,   or   if   there   is   some problem   with   the   format   and   we   will   have   to   standardize   the   formats   later

Expected   Format

Genres

Comma   separated   list   of   Genre

Time

mm:ss

Price

$[09]+(\.[09]{1,2})

Released

The   values   in   Released   attribute should   follow   “ddMMMyy”    but   in some   cases   only   the   “yy”   mentioned. Possible   Solutions 1. We   gather   more   information and   update   the   tuple   in   the database 2. Split   the   “released   ”  attributes into   “  released   date”   ,”released month”,   “released   year”   and make   the   first   two   optional.

e. Are   there   synonyms   among   attribute   values? No,   not   in   our   datasets. f. Sometimes   attribute   values   are   "sprinkled"   all   over   the   item.   Do   you   have this   problem   with   this   attribute?   The  A rtist   attribute   is   sprinkled   in   other   attribute   such   as  T rack   name.   For example,   in   the   track   ‘Weight   Of   The   World   [feat.   Blaque   Keyz]’   by   Jon   Bellion, the   name   of   another   featuring   artist   Blaque   Keyz   appears   but   isn’t   captured   as   a distinguished   attribute. g. Do   you   see   any   other   data   quality   problems   with   this   attribute?   Yes,   the   data   we   extracted   (in   Stage   1)   was   not   encoded   in   UTF8   which   lead   to IO   errors   while   reading   this   data   into   the   database   for   analysis.   So   we   had   to convert   the   file   into   UTF8   encoding   before   feeding   it   to   the   database. 5. List   any   software   tools   that   you   have   used   to   do   the   above   data   understanding and   cleaning. List   of   packages   used:   Pandas,   Jupyter   notebook,   matplotlib,   Numpy,   Scipy

([email protected]) b. Shantanu - GitHub

The common scheme has 9 attributes. 4. For each attribute X in S, ... and cleaning. List of packages used: Pandas, Jupyter notebook, matplotlib, Numpy, Scipy.

Download PDF

159KB Sizes 15 Downloads 262 Views

Report

David B. Maia - GitHub

Alarm clock - model B - GitHub

Display (Model B) - Base PCB - GitHub

B.IO.BepeTeuoa, M.l1.rypeauq, l1.A.EMeJIHH, B. - GitHub

Display (Model B) - Dots PCB - GitHub

Olga B. Botvinnik â Curriculum Vitae - GitHub

Variant A Variant B Variant C Variant D - GitHub

BeagleBoard-xM LCD7 Rev B System Reference Manual - GitHub

pdf-82\linux-shell-scripting-cookbook-second-edition-by-shantanu ...

B = B , A B , A \ B = AA, B U (A [ B) C = AC \ BC (A \ B) C ...

brÅ¡ b# b#ayéh

B

B b ki

brÅ¡ b# b#ayéh

Alice Bob g, p, AA = ga mod p B = gb mod p BK = A b mod p K ... - GitHub

GitHub

GitHub

Form B-08-004-B -

ENDF/B-VI Ni-61 Resonance Cross Sections Energy (eV) - GitHub

The equation for a straight line is y = mx + b - GitHub

The equation for a straight line is y = mx + b (1) - GitHub

bb b

MP-STMD 06-03 DQ420MA MP-STMD 06-03-B DQ420MA - GitHub

II B

([email protected]) b. Shantanu - GitHub

David B. Maia - GitHub

Alarm clock - model B - GitHub

Display (Model B) - Base PCB - GitHub

B.IO.BepeTeuoa, M.l1.rypeauq, l1.A.EMeJIHH, B. - GitHub

Display (Model B) - Dots PCB - GitHub

Olga B. Botvinnik â Curriculum Vitae - GitHub

Variant A Variant B Variant C Variant D - GitHub

BeagleBoard-xM LCD7 Rev B System Reference Manual - GitHub

pdf-82\linux-shell-scripting-cookbook-second-edition-by-shantanu ...

B = B , A B , A \ B = AA, B U (A [ B) C = AC \ BC (A \ B) C ...

brÅ¡ b# b#ayéh

B

B b ki

brÅ¡ b# b#ayéh

Alice Bob g, p, AA = ga mod p B = gb mod p BK = A b mod p K ... - GitHub

GitHub

GitHub

Form B-08-004-B -

ENDF/B-VI Ni-61 Resonance Cross Sections Energy (eV) - GitHub

The equation for a straight line is y = mx + b - GitHub

The equation for a straight line is y = mx + b (1) - GitHub

bb b

MP-STMD 06-03 DQ420MA MP-STMD 06-03-B DQ420MA - GitHub

II B

([email protected]) b. Shantanu - GitHub

Recommend Documents

Olga B. Botvinnik â Curriculum Vitae - GitHub

brÅ¡ b# b#ayéh

brÅ¡ b# b#ayéh