26
CHAPTER
2.
SORTING
2.10.
MERGING SORTED
ARRAYS
27
ployees
that
showed
every
shgle
kospital visit
they
had.The
goal
was
to
help
the researchers. The state
spe
丑
t
time
removing
identifiers such
as
name
, addressy
md
social security
IIUmber-TM
Governor of
MaSE
sachlmtts
assured
the
public
that
this
was
suffideI1t
to
pmtect
patmt
privacy-TheI1a
graduate
studeI1tr LataI1ya sweeIIey>
saw
significmt pita
falls h
this
approach.She
requested a
copy
of
the
data
aRd
by
COIlathg
the
data
hmultiple
ColumRSrshe
was
able
to
idmtify
the
health
records
of
the
GoverI1or.This
demonstrated
that
extreme care I1eeds
to
be
takerl
OIIymizing
data.One
way
of
msuriIIg
privacy
is
to
aggregate
data
such
that
any
record
cm
be
mapped
to
at
least k iI1dividualSF for some
large
value
of
k.
Problem
2.7:Suppose
you
are
giveIIa
matrix
My
where
each
row
rep-
resents m iI1dividual
md
each
Colum
represeI1ts m attribute
about
the
hdividual
such
as age
or
geI1der.GiveI1a set of
ColumI1s
to
be
deletedy
vouwmt
to
determhe
if each
row
has
at
least k duplicate rows
with
ex
缸
tly
the same contents
in
the
remaini
吨
C
仙
mns.
How
would
you
verify this efficiently?
2.8
VARIABLE
LENGTH
SORT
Most sorting algorithms
r
句
0
口
a
basic
swap
问.
Wh
en
records are of
different lengths
, the
swap
step becomes
nontrivia
l.
Problem
2.8: Sort lines of a text file
that
has
a million lines such
that
the average
length
of a line is 100 characters
but
the
longest line is one
million characters long.
2.9
UNIQUE
ELEMENTS
suppose
you
are giveI1a set of
mmes
md
your
job is
to
produce
a set of
UI1iqm first
names.If
you
just remove
the
last Ilame from all
the
na
you
may
have
some duplicate first names.
Problem
2.9:
How
would
you
create a set of first
names
that
has
each
name
occurring
∞
lyonce?
Specifically, design
an
efficient algorithm for
removing all the duplicates from
an
array.
岛
fax-heap
An
other data-structure
that
is useful
in
diverse
co
口
texts
is the max-heap,
sometimes also referred to as the priority queue. (There is
no
relationship
between
the
heap
data-structure
and
the
portio
口
of
memory
in
a process
bythe
samemme.)Aheapis
akiMofabimrytree-itsupports
O(logn)
iI1serts
md
COI1stmt
time lookup for the
max
element.(The
mbheap
is
a completely symmetric version of the data-structure
and
supports
con-
stant time lookups for the
min
elemen
t.)
Searching for arbitrary keys
has
O(η)
time
complexity-a
町
thi
吨
that
can
be
done
with
a
heap
can
be
done
with
a balanced
BST
with
the same complexity
but
with
possibly
some space
and
time overhead.
2.10
MERGING
SORTED
ARRAYS
You are given 500 files, each containing stock quote information for
an
SP500 company. Each line contains
an
update
of the following form:
1232111 131
B 1000 270
2212313 246 S 100
111.01
The first
number
is the
update
time expressed as the
number
of millisec-
onds since the start of the
day's
trading. Each file individually is sorted
by
this value. Your
task
is to create a single file containing all the
up-
dates sorted
by
the
update
time. The
individual
files are of the
order
of
1-100 megabytes; the combined file will
be
of
the
order of 5 gigabytes.
Problem
2.10: Design
an
algorithm
that
takes the files as described
above
and
writes a single file containing the lines
appearing
in
the in-
dividual files sorted
by
the
update
time. The algorithm
should
use
very
little
memory
, ideally of
the
order
of a few kilobytes.
2.11
ApPROXIMATE
SORT
Co
日
sider
a situation
where
your
data
is almost
sorted
一
-for
ex
缸口
pIe
,
you
are receiving time-stamped stock quotes
and
earlier quotes
may
arrive af-
terlater quotes because of differences
in
serverloads
and
network
routes.
Wh
at
would
be
the
most
efficient
way
of restoring
the
total order?
Problem
2.11: There is a
very
long
stream of integers arriving as
an
in-
put
such
that
each integer is
at
most
one
thousand
positions
away
from
its correctly sorted position. Design
an
algorithm
that
outputs
the in-
tegers
in
the correct
order
and
uses only a constant
amount
of storage,
i.
e.
, the
memory
used
should
be
independent
of the
number
of integers
processed.
2.12
RUNNING
AVERAGES
Suppose
you
are given a real-valued time series (e.g.,
temperature
mea-
sured
by
a sensor)
with
some
noise
added
to i
t.
In
order
to
extract
meaningful trends from
noisy
time
series
data
, it is necessary to
perform
smoothing.
If
the noise
has
a Gaussian distribution
and
the noise
added
to successive samples is
independent
and
identically distributed,
then
If you find the book helpful, please purchase a copy to support the authors!