About managing huge amounts of test data
Since 2010, the total investment in self-driving vehicle technology and smart mobility has reached approximately $206 billion worldwide, aiming to achieve Level 2+ (L2+) autonomous driving. To achieve Level 2 and beyond (L3 to L5), the investment is expected to double. This is truly a serious business. But there is one intractable challenge faced by all players in the market, including DXC. How do you manage the vast amount of data generated during testing? Companies that overcome this challenge will be able to lead the race toward Level 5.
I took the data. Well, what do we do?
A test vehicle can generate over 200TB of raw data during an eight-hour test period. Data collection on 10 vehicles can therefore generate approximately 2PB of data per day (assuming 8 hours of testing per vehicle per day). Now that you’ve collected a ton of different and useful data, how do you offload the data from your test vehicle back to the garage to your data center?
For example, metropolitan test centers can easily scale their network bandwidth to reliably deliver data to DXC’s data centers in North America, Europe, and Asia (see map below). Especially if the data is collected physically close to these data centers or if DXC’s logistics services are used. However, data collection is often located far from the data center, resulting in costly cross-border logistics services, and some customers choose to store data in the cloud.
DXC currently has two primary methods of transferring data to data centers and clouds, each with its own advantages and disadvantages. Until technological advances make it easier to address these challenges, here’s how we do it:
Method 1
Connect your vehicle to the data center. Our test vehicle generates about 28TB of data per hour. It takes 30-60 minutes to send the generated data to a data center or local buffer for offloading over a fiber optic connection. This is slow, but useful when the data is processed in relatively small units.
Method 2
In many cases, data loads are too high, and fiber optic connectivity is not available to upload data directly from the test vehicle to the data center (for example, geographically remote test drive areas such as deserts, frozen lakes, or rural areas). In such cases, two different approaches are taken.
a) bring/send the recording media to a dedicated station; In this method, the plug-in disk removed from the vehicle is first brought or shipped to the capture station, from which the data is uploaded to a central data lake. A disc change takes just a few minutes and the vehicle is immediately ready for the next test. The disadvantage of this method is that you need to have multiple sets of discs. So compared to method 1, you can say that you are buying time with money.
b) Put the central data lake in the cloud. Another version of method a). Data is uploaded from the ingest station to a central data lake in the cloud. The biggest challenge with this method is the bandwidth of the cloud connection. The current standard cloud service has a maximum bandwidth of 100Gbps per connection. A simple calculation in theory would mean that 1PB could be transferred to the cloud in 24 hours (actually half that amount of data could be transferred). This requires establishing a large number of parallel connections to the cloud. In addition, higher resolution (4K) vehicle sensors for R&D are generating more data, and this becomes a major challenge when network costs increase significantly with increased throughput. Become.
Future Roadmap for Data Ingestion
Given current research and technological advances, both of these data capture methods could quickly become obsolete as the onboard computer itself can perform the analysis and select the data it needs. I can’t. For example, if a test vehicle can distinguish between videos of a particular scene, such as a “right turn at a red light,” this would alleviate the need to send several terabytes of data to the main data center, allowing testers to send it over the internet (5G mobile data communication). ) will only need to send the pre-screened dataset.
Another possible innovation is a smart way to reduce data by recording fewer frames per second or lower resolution when nothing critical is happening. In this case, you should predefine the scenes that are considered critical. This means that data transfer and data collection programs should be strongly tied to the use case. Therefore, data collected once cannot be reused over and over for different use cases (training and testing are different algorithms and models). Such smart data reduction would be done in the test vehicle or as part of the data upload in the smart acquisition station.
In the long term, technological advances may include sensor reduction or lossless data compression at the sensor level. Current sensors follow the idea that “the higher the resolution, the better” (and there is also the idea that “the more and more types of sensors, the better”). Even if this concept is acceptable for a small number of R&D vehicles, it is difficult to implement in millions of consumer vehicles.
This brings us to the task of optimizing the sensor in order to reduce costs and data volume. Machine learning algorithms can definitely help with these challenges. Especially when neural network algorithms are combined with quantum computing to solve the problem of finding the optimal position and orientation of various sensors.
The data ingestion challenges covered here are just the beginning of data processing for AD/ADAS (Automated Driving/Advanced Driver Assistance Systems). The first steps to control data quality or extract metadata are often built into the ingestion process. However, subsequent processing steps such as data quality at scale, data catalogs, and data transformations are typically done in data lakes. This point needs to be examined in more detail.