More

How to perform an attribute-weighted sum of multiple polyline features from a single vector layer?

How to perform an attribute-weighted sum of multiple polyline features from a single vector layer?


I have a shapefile of ~8000 polyline vector features representing traffic journeys on road routes. Each has an attribute representing volume of traffic for that journey. The routes are all generated by the same routing engine so the node coordinates should match exactly where routes follow common sections of road. I am wanting to effectively add these together in a "map algebra" sort of way to give a representation of network usage. Keeping the result in vector would be preferable, but ultimately not essential.

I can see how this could be done conceptually simply by converting each feature to a separate raster of pixel values = traffic volume, then summing all the rasters but this would practically be very painful to combine 8000 high-resolution rasters. Ironically, QGIS is effectively already doing this visually for me - I am displaying the layer using combining of features by "addition" method… but I can't see any way of getting that directly out into a much higher resolution raster format than my display:

Surely there must be a vector way to do this? (QGIS or ArcGIS) If it makes it easier to not go the shapefile route I have all the route nodes in CSV format, one line per route. The polygon "union" then "dissolve" method outlined in How to union polygons and add attribute values of combined output features looked promising, but Union does not work on line features it seems. Flicking through the ArcGIS Network Analyst "Route Analysis" pages looked potentially promising if rather OTT for this purpose and I'm guessing would require a lot of messing round to get there - I'm sure there must be a simpler way!


Here is an example of how to perform this using Python, Fiona, and Shapely (I tried doing it in arcpy but gave up in frustration). I'm not sure if this is the most efficient approach but it should do the trick. The function first builds a set of unique segments from all of the routes, then identifies which route each segment is contained by (summing up the volume on each route), and finally writes out the segment and total volume to a new shapefile. The script is a bit rough and only tested on a sample data set I constructed. You will need to modify it to work with your particular data, and I'm not sure how quickly it will run with ~8000 input routes.

If you end up using please post a comment and let me know how long it took to run.

import fiona from shapely.geometry import shape, LineString, mapping def explode(): merged_lines = None unique_segments = [] input_routes = [] #Read input file and #build set of unique segments across all routes with fiona.open('testarcs.shp', 'r') as source: source_crs = source.crs source_driver = source.driver for route in source: #cache the current route and get current route geometry route_geom = shape(route['geometry']) input_routes.append((route_geom,route)) #break current route into segments route_segments = [LineString([route_geom.coords[seg_indx], route_geom.coords[seg_indx + 1]]) for seg_indx in xrange(len(route_geom.coords) - 1)] if not merged_lines: #this is the first iteration so all segments are unique unique_segments.extend(route_segments) merged_lines = route_geom else: #test segments to see if they are coincident with the current merged routes unique_segments.extend([curr_seg for curr_seg in route_segments if not merged_lines.contains(curr_seg)]) #union the current route with all routes that have been processed so far merged_lines = merged_lines.union(route_geom) output_schema = {'geometry':'LineString','properties':{'Tot_Vol':'int'}} #create output file with fiona.open('seg_summ.shp','w',driver=source_driver,crs=source_crs, schema=output_schema) as seg_outputs: new_rec = {} #Now iterate through the unique segments. #For each segment find the routes that it intersects with and #sum the volume on the routes for curr_segment in unique_segments: curr_volume = 0 for curr_route in input_routes: if curr_route[0].contains(curr_segment): curr_volume += curr_route[1]['properties']['Volume'] new_rec['properties'] = {'Tot_Vol' : curr_volume} new_rec['geometry'] = mapping(curr_segment) seg_outputs.write(new_rec) return def explode_rtree(): segments = {} segments_index = rtree.index.Index() segment_id = 0 source_crs = None source_driver = None #read in routes and break down to segments with fiona.open('testarcs.shp', 'r') as source: source_crs = source.crs source_driver = source.driver for route in source: #get current route geometry and volume route_geom = shape(route['geometry']) route_volume = route['properties']['Volume'] route_segments = [LineString([route_geom.coords[seg_indx],route_geom.coords[seg_indx + 1]]) for seg_indx in xrange(len(route_geom.coords) - 1)] for curr_segment in route_segments: #get set of segments with overlapping bounds candidates = list(segments_index.intersection(curr_segment.bounds)) if not candidates: #current segment does not intersect with any existing segment bounds #this means its a new segment segments_index.insert(segment_id,curr_segment.bounds) segments[segment_id] = [curr_segment,route_volume] segment_id += 1 else: #check each candidate to see if it the same as the current for curr_cand in candidates: if segments[curr_cand][0].equals(curr_segment): #the current segment is coincident with a segment that #has already been processed so add volume segments[curr_cand][1] += route_volume break else: #else clause of for loop #the current segment does not intersect with any candidates #so add it as a new segment segments_index.insert(segment_id,curr_segment.bounds) segments[segment_id] = [curr_segment,route_volume] segment_id += 1 output_schema = {'geometry':'LineString','properties':{'Tot_Vol':'int'}} #create output file with fiona.open('seg_summ2.shp','w',driver=source_driver,crs=source_crs,schema=output_schema) as seg_outputs: for segment_id in segments: new_rec = {} new_rec['properties'] = {'Tot_Vol' : segments[segment_id][1]} new_rec['geometry'] = mapping(segments[segment_id][0]) seg_outputs.write(new_rec) return

An object-oriented geographic information system shell

Geographic Information Systems (GIS) combine the requirement for graphical display of information with the requirement to manage complex, disk-based data. The object-oriented approach is recognized as an appropriate technology for meeting both of these requirements, and several attempts have been made to build a GIS using object-oriented data management systems. The paper considers the design of a GIS shell by extension of an object-oriented (database) system. A GIS shell does not include any application-specific objects, but extends a basic object-oriented system to provide spatial objects with appropriate behaviour. The starting point for this work was the set of requirements of users of an existing GIS shell. Central objectives are to provide multiple views of application objects, with independence from the stored representation of the spatial attributes. The paper discusses the principles employed in the design of the shell, and a 4-level architecture for organizing shell objects so as to meet the stated objectives. Implementation issues relating to the appropriateness of an object-oriented database management system are discussed towards the end of the paper.


The Package for Analysis and Visualization of Environmental data Frequently Asked Questions

Please note that although the CMAS Pave Support staff welcome your suggestions for PAVE enhancements and bug fixes, they can not guarantee implementation of these suggestions. If you require detailed support or would like to fund enhancements to PAVE, a support contract can be put in place. If you are interested in such an arrangement, please contact us via e-mail to [email protected]

The map files in this directory include:

Tile plots have a Map menu item which allows you to choose which one of these maps will be used in your plot. The default is medium resolution state outlines, which come from the file OUTLUSAM.

When PAVE projects the map outlines onto your plot, it first "pre-clips" out many of the polylines in the file before projecting them. In this way, significant trigonometric calculations can be avoided on lines that wouldn't show up on your plot anyway. The default pre-clip range is as follows:

Lines falling outside of the pre-clip range will not be drawn on your PAVE plot. So any domain which partially or completely falls outside of this default pre-clip range will have an incomplete map when rendered within a default PAVE tile plot. The way to get around this is two-fold: first you will need to modify the default pre-clip range used by PAVE, and second you'll need to tell PAVE to use the world map rather than the default medium resolution state outlines.

To modify the default pre-clip range used by PAVE, you can use either 1) environment variables set prior to launching PAVE, 2) command line arguments sent to PAVE at startup, or 3) arguments sent to PAVE's standard input after startup. First, choose a clipping range (bounded by lower-left and upper-right latitude, longitude coordinate pairs) that will enclose the entire geographic domain of your dataset. To use environment variables, enter the following commands at the command line prior to launching PAVE:

Arguments used to modify the pre-clip range used by PAVE, are of the form:

In either of the above cases, all subsequent tile plots in that PAVE session will use the newly supplied default clipping range until the range is changed again.

To tell PAVE to use the world map rather than the default medium resolution state outlines, there are two methods. One way is to use the -mapName command line argument with the full pathname/filename to the OUTLHRES file :

causes PAVE to use the supplied map name instead of the default map for all subsequent tile plots. An easier way to obtain the same results for a given plot is to simply use a tile plot's pull down "Map" menu to select the world map, and that plot will then use the world map. However if you are creating multiple plots of your data, you will probably find the first method easier, as it affects all subsequent plots rather than just a single plot.

Here is an example script which was used to create a tile plot with a dataset whose domain stretches from North America over to Europe and down to North Africa:

There is now a utility called arc2mcidas that allows you to translate ArcInfo '.gen' files with map line data into the McIDAS format, and then read it into PAVE. arc2mcidas is distributed with the latest PAVE release you should be able to find it in

The arc2mcidas input .gen format is pretty simple to get your data into, if you don't already have ArcInfo to generate that kind of file for you. Here is an excerpt from a sample .gen file: The format of the .gen files used as input to the arc2mcidas program is simply some arbitrary number of polylines, each polyline's data is denoted by: where the polyline number is totally ignored by arc2mcidas, and the lons are negative in the western hemisphere, the lats negative in the southern hemisphere, and every lat and lon must have a decimal in it. (note the ---- lines aren't part of the file format).

See the How do I use a world map? FAQ question for more information on using different maps within PAVE.

If you have another data format which you would like to look at using PAVE, you should probably convert it into Models-3 I/O API format. There are several library and include files included with the PAVE installation for your architecture that will help you with the conversion. You will need to write either a C/C++ or a FORTRAN program linking with these libraries in order to translate your data. Or your might decide to modify the original code which generates your data to just generate Models-3 I/O API data directly.

The files included with your PAVE installation are:

These can be found under the lib directory for the architecture you are on. For example, on an SGI, they would be located in IRIX5_mips/lib/.

There is source code for two example conversion programs is included with the PAVE distribution. One of them converts an ASCII ROM dataset with a Lat/Lon map projection, and the other converts a binary SAQM dataset with a Lambert Conformal map projection. Note that you will need additional libraries, and possibly header files, in order to successfully compile this code. These libraries and header files can be found in:

The example conversion code is in:

NOTE: these example codes will probably need to be adapted for use with your data files. They were written for specific data files being used for testing purposes, and were NOT originally intended to be cleanly written examples of how to easily convert generic gridded data data into PAVE format. However, they are being included here with the hope that you may find them useful.

For more detailed information on how to use these libraries and header files with your conversion code, please see the EDSS/Models-3 I/O API Help Pages at http://www.cep.unc.edu/empd/EDSS/ioapi/index.html.

First, you need to configure your account on the remote machine. (This should only need to be done once if you usually run PAVE on the same local machine.) To do this, add the following to your

  • setenv EDSS_HOST your_local_machine_name
  • setenv EDSS_HOST_USER your_local_userid
  • setenv EDSS_HOST_IP IP_address_for_your_local_machine
  • setenv EDSS_DIR dir_where_PAVE_is_installed
  • source $EDSS_DIR/scripts/setup_edss

Here the local_machine is the machine on which you will be running the PAVE user interface. To have the changes take effect, type source

/.cshrc on the remote machine. Type `which visd` on the remote machine and make sure it returns a directory name. If it doesn't, then the setup_edss script on the remote machine may not be properly configured.

To get started viewing remote data,

  • start pave on the local machine
  • let myport be the contents of /tmp/sbus_port_$USER on the local machine
  • log on to the remote machine, and type start_edss_daemons -port myport all

If things have worked properly, you should see some messages appear on the remote machine. You should then be able to browse remote data on that machine. To choose the remote machine for browsing files, click on the button in the center of the browser (below the file and directory lists) and type in a host.

For example, assume the local machine = tom.ncsc.org, the local user ID is george, and the remote machine=flyer.ncsc.org.

Note that you will need to do the last step only each time you want to browse data with a new PAVE session.

All subsequent plots will use the supplied value for height/width, until a non-positive value is supplied as a subsequent argument. At that point PAVE's default height/width will be used.

If you are doing a vector plot from a script, you must enter the following commands before the -vector command:

Note that you don't have to "type" the lines into standard input - you can copy and paste them with your X-windows system (with the middle mouse button if you're using xterms).

After selecting the layers you want to see, type a statement of the following form into PAVE's standard input: Here the arguments to -subdomain are: x1 y1 x2 y2 to define the region to be selected. This will cause the data where x = 80, and y is between 20 and 90 to be selected. To see this region on the screen, you can choose "Select Regions of Interest Matching Current Dataset". A window showing the domain will come up. Alternatively, you can select the cells with the mouse, but this gets difficult when grid cells are small and you want to select a very precise area.

Now that the layers and region to be plotted are selected, you need to set the cross section type. Under the Graphics menu, choose "Set Tile Plot Cross Section Type" and you will see a submenu of X, Y, or Z cross sections. To make a plot for the above example choose "X Cross Section" (remember this by noting that x is constant). Next, draw the plot with a formula using your dataset by choosing "Create Tile Plot" from the Graphics menu, and you will see the plot of that cross section. Currently, you cannot draw vectors on vertical cross section plots.

  1. Load and select a data set that contains vector data
  2. Click on the variables you want to use for the vector plot in the species list (e.g. UWINDa, VWINDa)
  3. From the Datasets menu, choose "Select Layer Ranges Matching Current Dataset"
  4. Set the sliders to the highest and lowest layers for which you want to see a cross section.
  5. Select the single row or column to plot either by typing
    -subdomain xmin ymin xmax ymax (note that either xmin=xmax or ymin=ymax for a vertical cross-section) into PAVE's standard input
  6. Choose "Select Regions of Interest Matching Current Dataset" from the Dataset menu to visualize the cross section you'll be viewing
  7. Under the Graphics menu, pull down "Set Tile Plot Cross Section Type", and select X Cross Section if xmin=xmax, or Y Cross Section if ymin=ymax
  8. For a vector only plot, type a line of the form -vector UWINDa VWINDa into PAVE's standard input for an example of a vector plot with a tile background, type -vectorTile tile_formula UWINDa VWINDa

If this happens, in your .Xdefaults file, add the following lines:

This should do the trick. Note: you will either need to log out and log back in again, or do an "xrdb

/.Xdefaults", in order for these changes to take effect.

The operators you mention produce a single number calculated over the currently selected levels, rows, columns, and time steps for the given formula. The version of PAVE you have (PAVE 1.4beta or earlier) expects to always calculate an array of data to make a plot - so if your formula is something like mean(O3a), you get that message. If the formula is mean(O3a)/NO2a, you will get a plot.

Subsequent versions of PAVE changed how this is handled. If you make a "plot" of mean(O3a), PAVE will now calculate the result and display it in the PAVE message window along with information about the currently selected domain from which that number was calculated.

Here is what the operators you mention above specifically do, along with a number of other operators which also return a single number:

The min and max operators behave a little differently:

Prior to PAVE version 2.1 alpha, there was a way (albeit cumbersome!) to get PAVE to average each grid cell in time. PAVE has a not well-known feature which allows you to specify an hour index after a variable name. For example, O3a:1 is the first hour of ozone. So, if you wanted to plot an each cell averaged in time over the first twelve hours of your data, you could enter and plot the following formula:

(O3a:1+O3a:2+O3a:3+O3a:4+O3a:5+O3a:6+O3a:7+O3a:8+O3a:9+O3a:10+O3a:11+O3a:12)/12

This is cumbersome and it also uses a lot of memory

We currently recommend that you download the latest version of PAVE, and use the -nhouraverage option, or from the Graphics menu, select the Create NHour Average Tile Plot Menu item.

The second line sets the font for the file browser used by PAVE. After adding these to

/.Xdefaults, you must either logout and log back in again or enter "xrdb

/.Xdefaults", then restart PAVE for these changes to take effect.

  • Select the layer of the variable you want using the Formulas. Select Layer Ranges Matching Current Formula menu item (or use the -levelRange command line argument).

You can resize a PAVE plot window interactively, or you can control the exact size of an image (height and width) using PAVE command line arguments -height and -width. If you'd like to make an image really really huge (bigger than the screen, for example), you'd need to use this method. See the PAVE documentation at paveScripting.html#Scripting for further information on the -height and -width command line arguments.

Thus, if you are exporting animations or otherwise saving images in PAVE, the window of interest must be in the foreground and should not be obscured by any other windows during the time that the images are captured. When saving an animation, the images are first captured, and then go through several conversion steps. It is OK to put the PAVE window in the background during the conversion steps (i.e. once it has finished animating the image).

These configurations can be saved to a file, and used in future pave sessions. For example, if /home/user/config contains the single line: and you start pave using then all of your plots will automatically use this notation. In fact, you might want to just alias "pave" to be "pave -configFile " if you always want this. Please see the Configuring Plots section of the PAVE User Guide at paveConfigure.html#Configuring for further information on how to do this.

Prior to PAVE version 2.1 Alpha, when you displayed PAVE back to a PC using software such as Hummingbird's Exceed, a problem can result when trying to create GIFs or MPEGs due to the number of colors available on the PC display. You may see a message in the PAVE window like: For versions prior to PAVE 2.1 Alpha, if you wish to capture images while displaying to a PC, bring up your PC's Display properties and set the Color Palette to use 256 colors. This problem does not occur when displaying back to UNIX machines.

Some of the features we would like to add are listed below.

    Plot observational data & compare with model data

(This is being funded for development for PAVE release 2.1 Alpha)

There are many possibilities here - we can plot explicit points, or gridded observations. There are a several potential types of rendering including colored circles or textual numbers at observation sites, and contour lines showing the observations (with the first two being easier to incorporate). It is likely that once we have the ability to plot explicit points, we can use the software to view point source emissions data of various types.

All (This is being funded for development for PAVE release 2.1 Alpha)

PAVE capabilities (as are practical) should be available via scripts and standard input.

People often want to use simple statistics to assist with there data analysis (e.g. mean, std. dev.). It would be good if PAVE could produce these. In addition, people may want to slice the data in different ways to obtain these statistics (e.g. mean for each time step or for each vertical layer). There are also a number of commonly used statistics that deal with comparing model data with observations. We may want to examine the "tables" generated for OTAG for ideas for other typically used statistics.

PAVE performance, both CPU and memory related, can be improved in a number of ways. For example, loops can be reordered to reduce paging to disk, memory leaks can be plugged, and memory can be reused in some formulas.

The PAVE user interface could use a number of enhancements. Probably the most important of these is to make all plots accessible via the user interface. For example, vector plots, scatter plots, and time series plots with multiple lines are not currently available via the user interface. In addition, small usability improvements could include: an hourglass cursor when PAVE is busy, an indicator of which "mode" a PAVE tile plot is in (e.g. probing, zooming, time series), making changes to the min and max take effect without hitting return, giving windows more meaningful names, and many more.

Additional kinds of plots have been requested by users. Some examples include: vertical cross section plots where the size of the layers is proportional to actual layer size, vertical profiles, contour plots, flux plots, observational data plots described above, PIG plots, nested grid plots, adaptive grid plots, box & whisker plots. Also, making scatter plots internal to PAVE would help reduce the problems associated with transferring data to BLT and would provide additional flexibility.

Currently, long formulas are often difficult to use. This might be improved by providing some defaults for commonly used long formulas (e.g. VOC, TOC, wind speed). Allowing user-defined aliases for long formulas and hierarchical formulas (e.g. NOy - NOx) would be very useful. Allowing the user to specify that multiple parts of a formula refer to the same data set would reduce the complexity involved with long formulas (e.g. [NO + NO2]a where the a is applied to both NO and NO2).

It would be useful to be able to plot multiple maps (e.g. state + counties, or counties + roads) using different colors and thicknesses as needed. Another possibility would be to allow backgrounds generated by packages such as GISs to be displayed - these might show roads, water, cities, etc.

It would be useful in some cases to allow users to define the cutoffs between the colors plotted (currently PAVE makes each bin of equal size from the min to the max). Another alternative might include assigning bins according to percentiles. Additional configuration options could be added to vector plots wrt the density of the vectors plotted.

We might use the colors allocated by netscape so the two packages do not fight for colors, we could prevent allocation of colors that are not actually being used, and provide the user with the option of a private color table so that PAVE is not at the mercy of other applications when allocating colors. Also, we could improve the options for black and white plots by adding patterns to the existing shades of gray.

The density and location of grid cell numbering on tile plots and in the domain selector could be improved. Also, the legend could be outlined in black to set off any white or light values and be positioned relative to the size of the legend labels. The size of tile plots could be made user configurable, and the aspect ratio preserved when the plot is manually resized. Data plotted on time series plots could be labeled more clearly.


Ways to clear an existing array A :

(this was my original answer to the question)

This code will set the variable A to a new empty array. This is perfect if you don't have references to the original array A anywhere else because this actually creates a brand new (empty) array. You should be careful with this method because if you have referenced this array from another variable or property, the original array will remain unchanged. Only use this if you only reference the array by its original variable A .

This is also the fastest solution.

This code sample shows the issue you can encounter when using this method:

Method 2 (as suggested by Matthew Crumley)

This will clear the existing array by setting its length to 0. Some have argued that this may not work in all implementations of JavaScript, but it turns out that this is not the case. It also works when using "strict mode" in ECMAScript 5 because the length property of an array is a read/write property.

Method 3 (as suggested by Anthony)

Using .splice() will work perfectly, but since the .splice() function will return an array with all the removed items, it will actually return a copy of the original array. Benchmarks suggest that this has no effect on performance whatsoever.

Method 4 (as suggested by tanguy_k)

This solution is not very succinct, and it is also the slowest solution, contrary to earlier benchmarks referenced in the original answer.

Performance

Of all the methods of clearing an existing array, methods 2 and 3 are very similar in performance and are a lot faster than method 4. See this benchmark.

As pointed out by Diadistis in their answer below, the original benchmarks that were used to determine the performance of the four methods described above were flawed. The original benchmark reused the cleared array so the second iteration was clearing an array that was already empty.

The following benchmark fixes this flaw: http://jsben.ch/#/hyj65. It clearly shows that methods #2 (length property) and #3 (splice) are the fastest (not counting method #1 which doesn't change the original array).

This has been a hot topic and the cause of a lot of controversy. There are actually many correct answers and because this answer has been marked as the accepted answer for a very long time, I will include all of the methods here. If you vote for this answer, please upvote the other answers that I have referenced as well.


Appraisal of infrastructural amenities to analyze spatial backwardness of Murshidabad district using WSM and GIS-based kernel estimation

Backwardness is the result of different factors that exist in a society. The present study focused on infrastructural facilities and basic amenities to estimate the spatial distribution of backward areas in the Murshidabad district of West Bengal, India. An adequate supply of infrastructure has long been essential for economic development for both academicians and policymakers. Improved infrastructure has an aggregate impact on income and economic development. From this point of view, the present study aimed to appraise the infrastructural facilities and basic amenities to analyze the spatial extension of backwardness. For the same, 17 decision criteria under four parameters including physical infrastructure, medical and health service, educational amenities and recreational facilities were selected and geospatial technique i.e. kernel density estimation was used for spatial mapping of those decision criteria to show the spatial density of available services. Concurrently, weight sum model as a multi-criteria decision approach was applied for map overlaying and displaying backward areas by considering reverse scale factor from 1 to 5. 5 indicate the high spatial density of infrastructural facilities and low backward areas and 1 indicates the low density of infrastructural facilities and high areas of backwardness. Using the above rating scale, the spatial extension of backwardness was estimated. Unlike using the traditional method to estimate backwardness, the present study applied a geospatial technique which is quite new in this type of study. The present study also measured the accuracy of the result using prediction accuracy. The result revealed that the overall prediction accuracy signifies 82% (Pa = 0.82) which validates the weight sum model and kernel density applied in spatial analysis of backwardness. The present study evidences the efficiency of geospatial technique which may also helpful for applying in different fields of research.

This is a preview of subscription content, access via your institution.


Feature selection and hyper parameters optimization for short-term wind power forecast

Accurate wind power forecasting plays an increasingly significant role in power grid normal operation with large-scale wind energy. The precise and stable forecasting of wind power with short computational time is still a challenge owing to various uncertainty factors. This study proposes a hybrid model based on a data prepossessing strategy, a modified Bayesian optimization algorithm, and the gradient boosted regression trees approach. More specifically, the powerful information mining ability of maximum information coefficient is used to select the important input features, and the modified Bayesian optimization algorithm is introduced to optimize the hyperparameters of the gradient boosted regression trees to acquire more satisfactory forecasting precision and computation cost. Datasets from a Chinese wind farm are used in case studies to analyze the prediction accuracy, stability, and computation efficiency of the proposed model. The point forecasting and multi-step forecasting results reveal that the performance of the hybrid forecasting model positively exceeds all the contrasted models. The developed model is extremely useful for enhancing prediction precision and is a reasonable and valid tool for online prediction with increasing data.

This is a preview of subscription content, access via your institution.


4.1 Training

The final DL-FRONT network was trained with 14 353 input and labeled data grid pairs (the number of time steps over the period with available CSB data) covering the years 2003–2007 using 3-fold cross-validation. Each of the three folds used 9568 (two-thirds of total) data grid pairs randomly chosen from the full set and randomly ordered in time. Training stopped when the loss did not improve for 100 epochs (passes through the training dataset), leading to training that lasted 1141, 1142, and 1136 epochs, respectively. Figure 3 shows loss and accuracy results of the training. The training and accuracy curves indicate that the network training appears to be converging on solutions that are not overfit to data and have an overall categorical accuracy of near 90 % (percentage of CSB fronts identified by DL-FRONT). Fold 3 produced the lowest loss and highest accuracy so those weights were selected as the final result. We used the final result network to generate 37 984 3-hourly front likelihood data grids covering the entire 2003–2015 time span.

Figure 3The training loss (a) and training accuracy (b) for each training epoch of the DL-FRONT NN over three cross-validation folds.

A sample output of the DL-FRONT algorithm and the corresponding CSB front locations for 1 August 2009 at noon UTC (a period not used in training) is shown in Fig. 4. The DL-FRONT results are very similar to the CSB fronts in terms of the general locations. There are spatial discrepancies that are sometimes large enough that the front locations do not overlap, and there are several discrepancies regarding the type of front. The DL-FRONT results are missing a Pacific coast cold front and a western mountains stationary front from the CSB observations. DL-FRONT identifies additional fronts in the Pacific Ocean and on Baffin Island in the Arctic these are beyond the areas regularly analyzed for fronts by the National Weather Service shown in Fig. 2.

Figure 4Side-by-side comparison of CSB (a) and DL-FRONT (b) front boundaries for 1 August 2009 12:00:00. The CSB fronts were drawn three grid cells wide. The intensities of the colors for the different front types in the DL-FRONT image represents the likelihood value (from 0.0 to 1.0) associated with each grid cell.


4.7. Geocoding

Geocoding is the process used to convert location codes, such as street addresses or postal codes, into geographic (or other) coordinates. The terms “address geocoding” and “address mapping” refer to the same process. Geocoding address-referenced population data is one of the Census Bureau’s key responsibilities. However, as you know, it’s also a very popular capability of online mapping and routing services. In addition, geocoding is an essential element of a suite of techniques that are becoming known as “business intelligence.” We’ll look at applications like these later in this chapter, but first let’s consider how the Census Bureau performs address geocoding.

ADDRESS GEOCODING AT THE U.S. CENSUS

Prior to the MAF/TIGER modernization project that led up to the decennial census of 2010, the TIGER database did not include a complete set of point locations for U.S. households. Lacking point locations, TIGER was designed to support address geocoding by approximation. As illustrated below, the pre-modernization TIGER database includedaddress range attributes for the edges that represent streets. Address range attributes were also included in the TIGER/Line files extracted from TIGER. Coupled with the Start and End nodes bounding each edge, address ranges enable users to estimate locations of household addresses.

How address range attributes were encoded in TIGER/Line files (U.S. Census Bureau 1997). Address ranges in contemporary TIGER/Line Shapefiles are similar, except that “From” (FR) and “To” nodes are now called “Start” and “End”. Also, changes have been made to field (column) names in the attribute tables. Compare the names of the address range fields that you looked at in the second Try This exercise to those above.

Here’s how it works. The diagram above highlights an edge that represents a one-block segment of Oak Avenue. The edge is bounded by two nodes, labeled “Start” and “End.” A corresponding record in an attribute table includes the unique ID number (0007654320) that identifies the edge, along with starting and ending addresses for the left (FRADDL, TOADDL) and right (FRADDR, TOADDR) sides of Oak Avenue. Note also that the address ranges include potential addresses, not just existing ones. This is to make sure that the ranges will remain valid as new buildings are constructed along the street.

A common geocoding error occurs when Start and End designations are assigned to the wrong connecting nodes. You may have read in Galdi’s (2005) white paper “Spatial Data Storage and Topology in the Redesigned MAF/TIGER System,” that in MAF/TIGER, “an arbitrary direction is assigned to each edge, allowing designation of one of the nodes as the Start Node, and the other as the End Node” (p. 3). If an edge’s “direction” happens not to correspond with its associated address ranges, a household location may be placed on the wrong side of a street.

Although many local governments in the U.S. have developed their own GIS “land bases” with greater geometric accuracy than pre-modernization TIGER/Line files, similar address geocoding errors still occur. Kathryn Robertson, a GIS Technician with the City of Independence, Missouri (and a student in the Fall 2000 offering of this course) pointed out how important it is that Start (or “From”) nodes and End (or “To”) nodes correspond with the low and high addresses in address ranges. “I learned this the hard way,” she wrote, “geocoding all 5,768 segments for the city of Independence and getting some segments backward. When address matching was done, the locations were not correct. Therefore, I had to go back and look at the direction of my segments. I had a rule of thumb, all east-west streets were to start from west and go east all north-south streets were to start from the south and go north” (personal communication).

Although this may have been a sensible strategy for the City of Independence, can you imagine a situation in which Kathryn’s rule-of-thumb might not work for another municipality? If so, and if you’re a registered student, please add a comment to this page.

AFTER MAF/TIGER MODERNIZATION

If TIGER had included accurate coordinate locations for every household, and correspondingly accurate streets and administrative boundaries, geocoding census data would be simple and less error-prone. Many local governments digitize locations of individual housing units when they build GIS land bases for property tax assessment, E-911 dispatch and other purposes. The MAF/TIGER modernization project begun in 2002 aimed to accomplish this for the entire nationwide TIGER database in time for the 2010 census. The illustration below shows the intended result of the modernization project, including properly aligned streets, shorelines, and individual household locations, shown here in relation to an orthorectified aerial image.

Intended accuracy and completeness of modernized TIGER data in relation to the real world. TIGER streets (yellow), shorelines (blue), and housing unit locations (red) are superimposed over an orthorectified aerial image. (U.S. Census Bureau n.d.). National coverage of housing unit locations and geometrically-accurate streets and other features were not available in 2000 or before.

The modernized MAF/TIGER database described by Galdi (2005) is now in use, including precise geographic locations of over 100 million household units. However, because household locations are considered confidential, users of TIGER/Line Shapefiles extracted from the MAF/TIGER database still must rely upon address geocoding using address ranges.

LEVERAGING TIGER/LINE DATA FOR PRIVATE ENTERPRISE

Launched in 1996, MapQuest was one of the earliest online mapping, geocoding and routing services. MapQuest combined the capabilities of two companies: a cartographic design firm with long experience in producing road atlases, “TripTiks” for the American Automobile Association, and other map products, and a start-up company that specialized in custom geocoding applications for business. Initially, MapQuest relied in part on TIGER/Line street data extracted from the pre-modernization TIGER database. MapQuest and other commercial firms were able to build their businesses on TIGER data because of the U.S. government’s wise decision not to restrict its reuse. It’s been said that this decision triggered the rapid growth of the U.S. geospatial industry.

Later on in this chapter we’ll visit MapQuest and some of its more recent competitors. Next, however, you’ll have a chance to see how geocoding is performed using a TIGER/Line data in a GIS.


Visualisation strategies for environmental modelling data

We present a framework that allows users to apply a number of strategies to view and modify a wide range of environmental data sets for the modelling of natural phenomena. These data sets can be concurrently visualised to find inconsistencies or artefacts. This ensures at an early stage that models set up for the simulation of hydrological or thermal processes will not give implausible results due to complications based on input data. A number of generally applicable visualisation techniques are provided by our framework to help researchers detect potential problems. We also propose a number of mapping algorithms for the integration of multiple data sets to resolve some of the most common issues. Techniques for the presentation of input- and modelling data in combination with simulation results are proposed with respect to the benefits of visualisation of environmental data within specialised environments. The complete workflow from input data to presentation is demonstrated based on a case study in Central Germany. We identify typical problems, propose approaches for a suitable data integration for this case study and compare results of the original and modified data sets.

This is a preview of subscription content, access via your institution.


Machine-learning is the automated process of uncovering patterns in large datasets using computer-based statistical models, where a fitted model may then be used for prediction purposes on new data. Despite the growing number of machine-learning algorithms that have been developed, relatively few studies have provided a comparison of an array of different learners — typically, model comparison studies have been restricted to a comparison of only a few models. This study evaluates and compares a suite of 10 machine-learners as classification algorithms for the prediction of soil taxonomic units in the Lower Fraser Valley, British Columbia, Canada.

A variety of machine-learners (CART, CART with bagging, Random Forest, k-nearest neighbor, nearest shrunken centroid, artificial neural network, multinomial logistic regression, logistic model trees, and support vector machine) were tested in the extraction of the complex relationships between soil taxonomic units (great groups and orders) from a conventional soil survey and a suite of 20 environmental covariates representing the topography, climate, and vegetation of the study area. Methods used to extract training data from a soil survey included by-polygon, equal-class, area-weighted, and area-weighted with random over sampling (ROS) approaches. The fitted models, which consist of the soil-environmental relationships, were then used to predict soil great groups and orders for the entire study area at a 100 m spatial resolution. The resulting maps were validated using 262 points from legacy soil data.

On average, the area-weighted sampling approach for developing training data from a soil survey was most effective. Using a validation of R = 1 cell, the k-nearest neighbor and support vector machine with radial basis function resulted in the highest accuracy of 72% for great groups using ROS however, models such as CART with bagging, logistic model trees, and Random Forest were preferred due to the speed of parameterization and the interpretability of the results while resulting in similar accuracies ranging from 65–70% using the area-weighted sampling approach. Model choice and sample design greatly influenced outputs. This study provides a comprehensive comparison of machine-learning techniques for classification purposes in soil science and may assist in model selection for digital soil mapping and geomorphic modeling studies in the future.


Watch the video: Problem-Solving Techniques #13: Weighted Scoring Model