All headers in scripts must use the standardised format. Examples of populated headers are shown below.
# ================================================================================================ #
# Description: produce plots that compare the K10 pscyhological distress
# measure to the SF-36 perceieved disability measure
#
# Input: [IDI_Sandpit].[DL-MAA2016-15].[k10_vs_mh_band]
#
# Output: comparison_plot = large gg that produces a density plot
# of those with need and those with little or no need
# by average SF36 MH score
#
# Author: E Walsh
#
# Dependencies: k10_sense_check.sas builds the input table that is queried
# this can be run through main_sofie.sas
#
# SQL query located in sql/k10_vs_sf36_mh_check.sql
#
# Notes:
#
# Issues:
#
# History (reverse order):
# 24 Feb 2017 EW v1
# ================================================================================================ #
/*********************************************************************************************************
Description: Produce descriptive stats around those with an MHA need using the K10 score as
a proxy for need
Input: sand.mha_pop_sofie_w7_wgt_adj
Output:
sf_sof_n_XXX = survey freq need cross tabulation output for variable XXX
sf_sof_n_XXX_rnd = rounded cross tabulation output for variable XXX
Author: E Walsh
Dependencies: see main_sofie.sas and run the libnames macros and formats scripts
Notes: The sign out rules for SoFIE differ to the standard RR3 rounding. See pp.27-28 of the
SoFIE User Guide.
Issues:
History (reverse order):
26 Jan 2017 EW v1
*********************************************************************************************************/
Comment should explain why you are doing things a particular way. If the code or concept is complex, then a few words explaining this concept or business rule is also required.
This is an example of a comment on what a piece of code is intended to do. Comments in the code must use the following notation for R.
# Load aggregated treatment trends available in IDI Sandpit using the sql query file
This is an example of a comment on explaining the concept behind the code. The /* */
comments must be used in SAS, rather than just with a *
for ease of troubleshooting.
/* Citalopram is used to treat both Mood-Anxiety and Dementia. Hence we can
definitively determine the diagnosis to be Mood anxiety only if there are
no other recorded cases of dementia for the individual.*/
While writing comments in SQL, do not use comments that start with double-hyphens (--
). Always use /* */
for comments in SQL scripts. This is because SAS wrappers are sometimes used to execute SQL scripts, and using a double-hyphen comment format may cause SAS wrapper code to fail.
work.temp
_temp_
main
and reside in sasprogs
if it is a SAS script or rprogs
if it is an R script (prefixes and suffixes are permitted as long as it is clear this is the main script)./* relative path */
%include "&si_source_path.\sasprogs\si_control.sas";
# One line per pipe operator
domains <- as.data.frame(sofiecostdata %>%
dplyr::select(subject_area) %>%
group_by(subject_area) %>%
summarise() );
/* one line per clause */
proc sql noprint;
select engine, sysvalue
into: db_engine separated by ''
, :db_schema separated by ''
from dictionary.libnames
where libname = "&db_lib";
quit;
Most projects use the following file structures. The folder structure must be decided and implemented before a project is started.
Note that some projects may not have R scripts. If you find the folders are not needed then delete them.
If you choose to completely deviate from this structure you need a good reason why. For example the SIAL project uses bqa_complete
and bqa_incomplete
to make it clear what scripts have had a business QA completed.
Do not use EG projects to manage your scripts
Note this section will require updating once the IDI has Gitlab installed.
git tag -a v1.1.0 -m "v1.1.0"
A badge can then be added to the README of the repository (beneath the title) by including the following:
[](https://badge.fury.io/gh/nz-social-investment-unit%2<<repository_name_here>>)
where you enter the repository name in the <<repository_name_here>>
section. For example:
[](https://badge.fury.io/gh/nz-social-investment-unit%2Fsocial_investment_data_foundation)
Refer to our version control document to find out more about passing code through the firewall and how to clone repos to your PC.
/* Load all the macros automatically */
options obs=MAX mvarsize=max pagesize=132
append=(sasautos=("&use_case_path\sasautos"));
Run query plans so that the queries can be optimised. (Hit Ctrl + L when you are in the script window and find the large estimated cost percentages then see if the query can be rewritten in a better way).
Use explicit passthroughs where possible
proc sql;
connect to odbc (dsn=idi_clean_archive_srvprd);
create table _temp_address_notification as
select *
from connection to odbc(
select distinct
a.snz_uid
,b.ant_region_code
,b.ant_ta_code
from [IDI_Sandpit].[&si_char_proj_schema.].[&si_char_table_in.] a
inner join [IDI_Clean].[data].[address_notification] b
on a.snz_uid = b.snz_uid and
a.&si_as_at_date between b.ant_notification_date and b.ant_replacement_date);
quit;
Create clustered indexes on tables that are used often.
proc sql;
connect to odbc(dsn=idi_clean_archive_srvprd);
execute(create clustered index cluster_index on
[IDI_Sandpit].[&si_proj_schema].[&db_ds] (&si_index_cols)) by odbc;
disconnect from odbc;
quit;
When joining large tables in SAS use hash objects. If the objects you have in SAS wont fit into memory then try using an index.
data &si_char_table_out. (drop = return_code:);
set &si_char_table_in.;
/* dont declare the hash object each time a row is read */
if _N_ = 1 then
do;
/* sneaky way to load the columns into the pdv without the data */
if 0 then set work._temp_personal_detail;
declare hash hpd(dataset: 'work._temp_personal_detail');
hpd.defineKey('snz_uid');
hpd.defineData('snz_birth_year_nbr','snz_birth_month_nbr','as_at_age','snz_sex_code','snz_spine_ind');
hpd.defineDone();
end;
return_code_pd = hpd.find();
run;
Turn on the SAS trace to see what the database is up to. Remember to disable this once code is production-ready or it will slow down the actual execution due to additional overhead of writing to log.
/* display the detail of the sql statements and calls */
/* the nostsuffix removes the trailer info that is difficult to read */
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
Use mlogic to print out detailed SAS logs for ease of troubleshooting. Remember to disable this once code is production-ready.
/* display details of SAS statement execution */
options mlogic mprint;
For key programs try-catch statements are to be used in R
r_data.frame <- function(conn, database_query){
out <- tryCatch(
{
sqlQuery(channel = conn, query = paste(readLines(database_query), collapse="\n"))
},
error = function(err){
message(paste("Error in reading in data from db: ", err))
}
)
return(out)
}
For key programs in SAS use macro conditionals and note any errors or warnings in the log.
%if "&db_engine" ~= "ODBC" %then
%do;
%put ERROR: In md_write_to_db.sas - non ODBC engine specified. Are you sure you are writing to the database?;
%end;
%else
%do;
...
%end;
It is recommended to keep a debug provision while writing SAS macros which retains all temporary tables. this makes it easier to troubleshoot the code.
/* clean up */
%if %sysfunc(trim(&si_debug.)) = False %then
%do;
proc datasets lib=work;
delete _temp_:;
run;
%end;
The RUnit package will be used for R functions
test.<function_to_test> <- function() {
checkEquals(<function_to_test>(<function_parameters>), <returned_value>)
checkException(<function_to_test>("xx"))
Assert comments and macro calls are acceptable in SAS. Do not use the sasunit macros. This will create dependencies with another package that is not officially part of SAS that no one from within the team has QAd yet.
/* assert: should write a cluster index on two columns snz_uid and the date column */
%si_write_to_db(si_write_table_in=work.timedata, si_write_table_out=sand.test_cluster_index_two_col,
si_cluster_index_flag=True, si_index_cols=%bquote(snz_uid, datetime));
/* checkException: should give you an error about positional parameters must precede keyword parameters */
%si_write_to_db(si_write_table_in=work.timedata, si_write_table_out
si_cluster_index_flag=True, si_index_cols=snz_uid, datetime);
/* assert: table &si_char_table_out. exists and is not empty */
/* assert: because this table is not in work you should also get a note about it being written to work */
/* stress test ~2.5 million rows run time ~ 1.5 minutes */
%si_get_characteristics(si_char_proj_schema=DL-MAA2016-15, si_char_table_in=distinct_mha_pop,
si_as_at_date=date_diagnosed, si_char_table_out=work.mha_pop_char);
All temporary files in SAS should be deleted at the end of a script temporary files can be identified by the _temp_
notation. It is recommended to wrap these statements around with a conditional execution based on a debug flag.
/* clean up */
proc datasets lib=work;
delete _temp_:
run;
Think through the purpose of a graphic (whether it is exploratory, or part of the modelling process, or presentation of results) and the intended audience (you, other analysts, public) and hence choose graphics appropriately and only spend proportionate time on polishing.
# example of using the SIU theme
sofie_wts_freqpoly <-
ggplot(sofie_wts_lng, aes(wt, colour = weight_var)) +
geom_freqpoly(bins = 50) +
scale_colour_siu() +
theme_siu() +
labs(x = "Weight", y = "Count",
title = "SoFIE Weight Distribution - Original vs. Adjusted") +
scale_y_continuous(label = comma)
Minimal reliance on point shape and linetype - more than about three variants and they get hard to tell apart.
Order levels in legends to minimise the eye’s workload in travelling from the legend to the data; and consider direct labels as an alternative.
... +
scale_x_discrete(labels = c("Male", "Female"))
... +
coord_flip()
plot(1:10, 1:10, main = "M\u0101ori with a macron")
It is recommended that tabular outputs conform to the pre-defined templates whenever possible. Use of these standardised templates enable StatsNZ to automate output checks and reduce the turnaround time taken for signing these out of the IDI.
../templates/model_output_stats.xlsx
and the function that builds this output is available in ../github_generic/rprogs
.../templates/model_coefficients.xlsx
and the function that builds this output can be found in ../github_generic/rprogs
.../templates/clustering_output.xlsx
and the function that builds this output can be found in ../github_generic/rprogs
.../templates/numeric_summary_stats.xlsx
.SIU Tracking Number: TBA
Last updated June 2017 by Ernestynne Walsh and Vinay Benny