Monday, March 19, 2012

Last non nulls from a set of rows

I'm in a situation where I can't alter how data is being collected, and it's
an extremely large set of information.
The table all of this data is being dumped into is a transaction style table
with heavy null propogation. The only "real data" out of the table is the
last not null value entered (defined by the record time). For example:
Record 1. 1/3/2006 3:00PM, ID2006, Null, Null, 1006, ABC.
Record 2. 1/3/2006 5:00 PM ID2006, 32, Null, Null, Null
Record 3. 1/3/2006 9:00 PM ID2006, Null, Null, 1007, Null
The result set I would want from the above data is:
ID2006, 32, Null, 1007, ABC
Or in English, The values for ID2006 that are not null in order of most
recent record time.
As an aside, I also can't depend on records being inserted sequentially by
date.
I know I can do this with a series of selects using the TOP keyword, but is
there a more elegant way to deal with this problem? The individual selects
are pretty expensive over such a large amount of data.
Many Thanks!Hi Justin Weinberg,
I don't know how useful this will be but, I thought it might be fun to
play around with it. Go ahead and copy and paste the script below. I
added a second 'ID' record to be sure it was rolling the data up and
grouping correctly.
SET nocount ON
SET ansi_warnings off /* we'll be eliminating NULL values with an
aggregate function. don't need to be reminded */
CREATE TABLE #tbl (
entrydate datetime NOT NULL,
rowid CHAR(6) NOT NULL,
f1 INT NULL,
f15 INT NULL,
f2 INT NULL,
f3 CHAR(4) NULL)
INSERT #tbl
VALUES('1/3/2006 3:00:00.000 PM',
'ID2006',
NULL,
NULL,
1006,
'ABC.')
INSERT #tbl
VALUES('1/3/2006 5:00:00.000 PM',
'ID2006',
32,
NULL,
NULL,
NULL)
INSERT #tbl
VALUES('1/3/2006 9:00:00.000 PM',
'ID2006',
NULL,
NULL,
1007,
NULL)
INSERT #tbl
VALUES('1/3/2006 10:00:00.000 PM',
'ID2008',
NULL,
NULL,
1006,
NULL)
INSERT #tbl
VALUES('1/3/2006 6:00:00.000 PM',
'ID2008',
NULL,
NULL,
NULL,
'ACO.')
INSERT #tbl
VALUES('1/3/2006 11:00:00.000 PM',
'ID2008',
1734,
NULL,
NULL,
NULL)
INSERT #tbl
VALUES('1/3/2006 11:00:00.000 PM',
'ID2008',
NULL,
12,
NULL,
NULL)
SELECT #tbl.rowid,
Max(#tbl.f1) AS maxf1,
Max(#tbl.f15) AS maxf15,
Max(#tbl.f2) AS maxf2,
Max(#tbl.f3) AS maxf3
FROM #tbl
LEFT JOIN (SELECT Max(entrydate) AS ed,
rowid
FROM #tbl
WHERE f1 IS NOT NULL
GROUP BY rowid) AS f1results
ON #tbl.rowid = f1results.rowid
AND #tbl.entrydate = f1results.ed
LEFT JOIN (SELECT Max(entrydate) AS ed,
rowid
FROM #tbl
WHERE f15 IS NOT NULL
GROUP BY rowid) AS f15results
ON #tbl.rowid = f15results.rowid
AND #tbl.entrydate = f15results.ed
LEFT JOIN (SELECT Max(entrydate) AS ed,
rowid
FROM #tbl
WHERE f2 IS NOT NULL
GROUP BY rowid) AS f2results
ON #tbl.rowid = f2results.rowid
AND #tbl.entrydate = f2results.ed
LEFT JOIN (SELECT Max(entrydate) AS ed,
rowid
FROM #tbl
WHERE f3 IS NOT NULL
GROUP BY rowid) AS f3results
ON #tbl.rowid = f3results.rowid
AND #tbl.entrydate = f3results.ed
GROUP BY #tbl.rowid
ORDER BY #tbl.rowid
DROP TABLE #tbl|||Justin,
Here is a solution likely to be faster:
create table T (
d datetime,
id char(6),
a int,
b int,
c int,
x char(3)
)
go
insert into T values('20060103 15:00:00','ID2006',null,null,1006,'ABC')
insert into T values('20060103 17:00:00','ID2006',32,null,null,null)
insert into T values('20060103 21:00:00','ID2006',null,null,1007,NULL)
go
select
id,
ltrim(substring(max(convert(char(17),d,1
12)+str(a)),18,12)) as a,
ltrim(substring(max(convert(char(17),d,1
12)+str(b)),18,12)) as b,
ltrim(substring(max(convert(char(17),d,1
12)+str(c)),18,12)) as c,
ltrim(substring(max(convert(char(17),d,1
12)+x),18,3)) as x
from T
group by id
go
It will be even faster if you can afford to make the following
changes:
create table T (
d datetime,
id char(6),
a int,
b int,
c int,
x char(3),
n as case when a+b+c+case when x is not null then 0 end is null then 1 end
)
go
create clustered index T_ci on T(id,n)
<same query>
Steve Kass
Drew University
Justin Weinberg wrote:

>I'm in a situation where I can't alter how data is being collected, and it'
s
>an extremely large set of information.
>The table all of this data is being dumped into is a transaction style tabl
e
>with heavy null propogation. The only "real data" out of the table is the
>last not null value entered (defined by the record time). For example:
>Record 1. 1/3/2006 3:00PM, ID2006, Null, Null, 1006, ABC.
>Record 2. 1/3/2006 5:00 PM ID2006, 32, Null, Null, Null
>Record 3. 1/3/2006 9:00 PM ID2006, Null, Null, 1007, Null
>The result set I would want from the above data is:
>ID2006, 32, Null, 1007, ABC
>Or in English, The values for ID2006 that are not null in order of most
>recent record time.
>As an aside, I also can't depend on records being inserted sequentially by
>date.
>I know I can do this with a series of selects using the TOP keyword, but is
>there a more elegant way to deal with this problem? The individual selects
>are pretty expensive over such a large amount of data.
>Many Thanks!
>
>
>|||Shorter and easier to maintain, to boot!
Now I know why nobody wants to work on my queries :-)

No comments:

Post a Comment