Executive Summary

This study involves the compilation and assessment of a number of freight-related data sources. These data sources are provided on an accessible web-based environment that facilitates freight modeling and analysis at the state level. A subset of data sources are identified for use in data mining and exploratory analysis to help identify gaps in the data that are needed for freight modeling projects.

The first step in this project involves a compilation of all the publicly available data sources in the country pertaining to freight-related data in the California geographical region. Among these data sources, a subset is identified for detailed quality assessment and crosssectional analysis for common metadata queries in a web-based Geographical Information System (GIS) repository. The quality assessment framework is based on one developed by the International Monetary Fund (IMF). Novel data sources are also evaluated qualitatively in terms of modeling needs and costs.

Among the publicly available data, a core set was identified to serve as a representative spread of freight-related data. This core set includes data from the Freight Analysis Framework (FHWA), U.S. Waterborne Commerce (USACE), Transborder Surface Freight Data (US Census), Rail Carload Waybill Sample (USDOT), and Vehicle Inventory and Use Survey (US Census) to capture commodity flows and truck load rates. The Economic Census (US Census), County Business Patterns (US Census), and Labor Market Library (CA DOF) account for socioeconomic data. The Transportation Energy Data Book (US DOE) and Air Quality System (US EPA) offer data on energy consumption and emissions. Caltrans Truck AADT, MVSTAFF, WIM, and Freeway Performance Measurement System provide data on trucks throughout the state. The Port Wages and Tonnage Report (PMA), Major Airport Operating Statistics (RAND CA), Border Crossings Data (RAND CA), and Rail Performance Measures Weekly Performance Reports (AAR) round out the data on intermodal facilities.

Data mining consists of exploratory analysis to identify hidden relationships between different variables. The initial portion of this task explores data sets individually. Data errors, inconsistencies, and trends are identified and visually represented for each data set. More importantly, cross-sectional data analysis is conducted to search for data relationships between multiple sources. This task is divided into two portions: an unsupervised learning approach and a supervised learning approach.

Some data mining algorithms were applied to the core data sets to overcome some of these imperfections. A k-means cluster approach was used with a series of stratifying environmental and socioeconomic attributes – emissions, truck AADT, employment, population, and geographic location – to classify the counties in California into 25 groups. These groups can then be used to aid in survey sampling or disaggregation validation.

Regression models were developed to estimate the missing values or time periods in some of the data sets at the zonal level. Along with the disaggregation regression models, these models are stored in the freight repository for querying.

An innovative input-output based approach similar to the matrices used in economic sector analysis is applied to commodity OD pairs to quantify the interconnectivities between them. The idea is that OD pairs sharing high interconnectivities should have higher likelihood of being in the same distribution channel, i.e. be part of an intercity commodity-based tour. By apply information theory, i.e. maximizing entropy, we can intelligently guess the most related OD pairs given the information that we have available on existing OD pair totals.

In addition to the data mining algorithms, we developed several solutions to overcome data integration issues. A GIS-based interface is adopted as a solution to relate zonal and network data on a common platform. This interface allows users to perform queries by geographical attributes, which makes the information system much more powerful than simply combining all the data sources into one traditional table.

A GIS shapefile of the PeMS stations was obtained from colleagues in a related project. The shapefile allows users to map VMT data to a freeway network, which provides a bridge through which zonal and network/corridor level data can be interchanged. A customized search engine was developed to provide a quick source for intelligently searching through the internet from sites that are most related to freight. This search engine supplements the data integration needs of the data repository.

In addition to the regression models, a number of use cases were identified for designing queries in the freight repository. These use cases include queries along three different representations: a zonal level, a zone-to-zone level, and a network level. The zonal level would feature all the attributes evaluated for the k-means clustering. The zone-to-zone level would include FAF2 commodity flow data along with the interconnectivities estimated for the OD pairs. The network/facility level queries would be along the road or rail networks and intermodal facilities.

Lastly, the major data gaps identified included a lack of modal transfer information from the FAF2, a lack of statewide truck screenline survey data for model validation, a lack of updated truckload conversion rates, a lack of information on shipper intercity touring/distribution channels, a lack of fine geographic resolution commodity flow data, and a lack of transport costs, among others.

To address some of these primary gaps, several surveys are identified. An intermodal facility survey can be conducted to update and add more information on where modal transfers can take place. The VIUS can be updated with a small sample survey. A screenline survey can be identified within the major clusters and corridors within each cluster. Survey instruments, sampling frames, and survey plans were determined for the intermodal facility survey as one
example.

The outcomes of this project should be taken as twofold. First, the data repository provides a source for freight data for users at every level in California. This data is presented on a geographical interface using a common metadata architecture to encourage consistency between users and stakeholders. We recommend that the repository go through a user-based test phase to ensure that it meets all the user needs.

Second, the data analysis and mining efforts helped identify and address some common gaps in freight data. Some of these gaps, such as inconsistencies in data, correlations between data sets, distribution channels, and intermodal facility transfers may be addressed by the current data mining solutions and initial survey proposed. However, several high priority gaps were identified that need to be addressed in the future before or in conjunction with any statewide freight modeling. Two such gaps include 1) the need to develop a screenline survey plan to observe truck/rail counts and OD by commodity groups, and 2) the need to update the truckload conversion rates in the VIUS. The screenline survey can benefit significantly in cost effectiveness from the cluster analysis. The VIUS update can be conducted with just a small sample to use with the 2002 sample to extrapolate the conversion rates to 2010+ rates. These rates are important to convert any data obtained from screenline surveys in the near future.