Documentation Index

Fetch the complete documentation index at: https://docs.datajet.app/llms.txt

Use this file to discover all available pages before exploring further.

ProcessSegments

Prev Next

Loads a file containing arrays of segmentation data into a segment table and creates a file of unique segment IDs

KeyValue(s)Description
method"ProcessSegments"Loads a file containing arrays of segmentation data into a segment table and creates a file of unique segment IDs
sourcepath"Path"Path to directory containing raw data to be processed.   If data are stored in more than one folder, this should be the root folder immediately above the individual data folders.
targetpath"Path"Root folder for storage of processed data.  Generally the same as sourcepath.  Location where HASH-KEY file is stored.
Filename is maindatafile.dat
segspath"Path"Root folder for processed segment files.
dirs[][
"subfolder1"
"subfolder2"
"..."
]
List of folders containing raw data.
childfolder"FolderName"optional.  name of child folders if the daily folders contain them.
verbosetrue/falsedefault = false.  If true, provides additional logging in "info" section of API response.
finalConverttrue/falsedefault = true.  If true creates maindatafile.txt from maindatafile.dat, ready for loading into a datajet table.
cleanOnStarttrue/falsedefault = false.  If true, removes maindatafile.dat before starting processing.
sampletrue/falsedefault = false.  If true loads first file in each specified folder.
writekeystrue/falsedefault = true.  If true, writes out the segment files.  Set to false in order to just test generation of the hash-key file.
numericFoldersOnlytrue/falseignore non-numeric folders when identifying sub-folders to process
ignoreCompressedFolderstrue/falseignore folders containing only compressed data (*.gz, *.zip, *.rar)
project

lastFolder
maximum number of folders to process, starting with the most recent folder name  and going backwards  (assuming that folders have date names, e.g., 20240213)
maxLines
maximum number of lines to process - up to 7,000,000,000 (7 billion)
This is the number of input lines that will be processed before the process stops.
Note: maxLines is determined by available RAM on the Datajet Server.  256Gb RAM is required if processing up to 7 billion lines.
collatedtrue/falseDeprecated in v 6.11.11.01
Default = true.  Processing optimizes audience calculation performance.
segmentMapstrue/falsefrom v 8.6.2.01
Default = false
If true, accelerated bitmaps will be created in the segment folders. This adds approx. 4% to the loaded data, but gives fast query times when working with segment datasets.
intermediateModetrue/falsefrom v 8.6.2.01
Default = false
If true, requires data to have been pre-processed (see PreProcessSegments). ProcessSegments will find nothing to process if intermediateMode = true and no pre-processed files are present.
If false, accelerator files will be ignored.
enhanceDatatrue/falsefrom v 8.6.2.01
For information only.  enhanceData = intermediateMode.  If intermediateMode = true, enhanceData = true.
enhanceData generates additional metrics for hash keys:
  • Additional metrics - basic metrics for records in the primary contact table are available: 
    • entries: total number of times hash code appeared in input files
    • total_segments: total number of segments in which hash code appears
    • unique_segments: unique number of segments in which hash code appears

NOTE: To see these metrics, the CreateTableFromFile method that loads the primary contact table must be modified to include the additional values


{
  "method": "ProcessSegments",
  "sourcepath": "/home/engine/datasources/OneTouch/Eyeota/mft/US/",
  "targetpath": "/home/engine/datasources/OneTouch/Eyeota/mft/US/",
  "segspath": "/home/engine/datasources/OneTouch/Eyeota/mft/US/segs/",
  "dirs": [
    "20240318",
    "20240319",
    "20240320",
    "20240321",
    "20240322",
    "20240323",
    "20240324"
  ],
  "finalConvert": true,
  "cleanOnStart": true,
  "sample": true,
  "writekeys": true,
  "numericFoldersOnly": true,
  "ignoreCompressedFolders": true,
  "description": "Process Eyeota Segments",
  "project": "eyeota",
  "tooltip": "Takes raw data in unzipped format and turns into segment files and hash key file"
}

The following shows how to use ProcessSegments with data stored in a sub-folder of the primary folders:

{
  "method": "ProcessSegments",
  "sourcepath": "/home/engine/datasources/OneTouch/Eyeota/mft/US/",
  "targetpath": "/home/engine/datasources/OneTouch/Eyeota/mft/US/",
  "segspath": "/home/engine/datasources/OneTouch/Eyeota/mft/US/segs/",
  "dirs": [
    "20240318",
    "20240319",
    "20240320",
    "20240321",
    "20240322",
    "20240323"
  ],
  "childfolder": "HEMSHA2",
  "finalConvert": true,
  "cleanOnStart": true,
  "sample": false,
  "verbose": false,
  "writekeys": true,
  "description": "Process Eyeota Segments",
  "project": "Q1Patch1Eyeota_Pro"
}


Loading Primary Contact Table

ProcessSegments generates 2 outputs:

  1. Segment files - located in segspath
  2. maindatafile.* - located in targetpath

To view additional metrics (i.e., those generated by enhanceData = true) the following should be added to the CreateTableFromFile  method that creates the Primary Contact Table:

"definition": [
    "SHA256|CONTINUOUS|STRING|68",
    "key|CONTINUOUS|INTEGER|",
    "entries|DISCRETE|INTEGER|BYTE",
    "total_segments|DISCRETE|INTEGER|BYTE",
    "unique_segments|DISCRETE|INTEGER|BYTE"
  ],
  "loading": [
    "SHA256",
    "key",
    "entries",
    "total_segments",
    "unique_segments"
  ],


Report File Content

keyvaluedescription
ProcessSegment"6.11.6.1"version of the segment processor used to generate the report
sourcepath"/home/engine/datasources/US/"location of the source data
targetpath"/home/engine/campaignRoot/[realm]/[project]/"Root folder for storage of processed data (i.e., main hash file).  Generally the same as sourcepath.  
segspath"/home/engine/campaignRoot/[realm]/[project]/segs"

Root folder for processed segment files.
childfolder "childfolder_name"name of childfolder, if used.  For example  "HEMSHA2"
dirs[][
"dir1"
"dir2"
"..."
]
list of source folders containing raw data that are eligible for processing
writekeystrue/falseIf true, segment files have been written
checkingtrue/falsereserved for future use
sampletrue/falseIf true, a sample of 
verbosetrue/falseIf true, logging is detailed
counttrue/falsereserved for future use
followonlytrue/falsereserved for future use
collatedtrue/falsereserved for future use
files_processedNnumber of files processed in total
folders_processed[
"Dir1"
"Dir2"
...
]
list of folders included in processing (a subset of dirs[])
folders_processed_lines[
N1
N2
...
]
total number of lines processed in each folder (corresponds to folders_processed)
processedDirs[]["dirN"]name of the folder that stores the segment files (underneath segsPath)
totalLines2000000000total number of lines processed (usually 2 billion)
uniqueLines997217069total number of unique ids extracted from all processed folders (this is the number of records in the primary contact table)
maxLines2000000000
start"2024-11-06 11:43:30"time-stamp of start processing
end"2024-11-06 18:35:53"time-stamp of end processing
durationNnumber of seconds to process segments

Sample reportFile contents:

{
  "ProcessSegment": "2024.7.22.1",
  "sourcepath": "...mft/US/",
  "targetpath": ".../datasource-audience/",
  "segspath": ".../datasource-audience/segs/",
  "childfolder": "HEMSHA2",
  "dirs": [
    "20240707",
    "20240706",
    "20240705",
    "20240704",
    "20240703",
    "20240702",
    "20240701"
  ],
  "writekeys": true,
  "checking": false,
  "sample": false,
  "verbose": false,
  "count": false,
  "followonly": false,
  "collated": true,
  "files_processed": 757,
  "folders_processed": [
    "20240707/HEMSHA2",
    "20240706/HEMSHA2",
    "20240705/HEMSHA2",
    "20240704/HEMSHA2",
    "20240703/HEMSHA2",
    "20240702/HEMSHA2",
    "20240701/HEMSHA2"
  ],
  "folders_processed_lines": [
    535469843,
    836427346,
    109439092,
    598006467,
    738713871,
    46902468,
    93804936
  ],
  "processedDirs": [
    "20240707"
  ],
  "totalLines": 2000000000,
  "uniqueLines": 1179447368,
  "maxLines": 2000000000
  "start": "2024-11-06 11:43:30",
  "end": "2024-11-06 18:35:53",
  "duration": 24743
}
{  
  "sourcepath": "/home/engine/datasources/OneTouch/Eyeota/mft/US/",
  "targetpath": "/home/engine/datasources/campaignRoot/onetouch-dev01/eyeota-audience/",
  "segspath": "/home/engine/datasources/campaignRoot/onetouch-dev01/eyeota-audience/segs/",
  "childfolder": "HEMSHA2",
  "dirs": [
    "20240508",
    "20240507",
    "20240506",
    "20240505",
    "20240504",
    "20240503",
    "20240502",
    "20240501",
    "20240430",
    "20240429",
    "20240428",
    "20240427",
    "20240426",
    "20240425",
    "20240424"
  ],
  "writekeys": true,
  "checking": false,
  "sample": false,
  "verbose": false,
  "count": false,
  "followonly": false,
  "processedDirs": [
    "20240508",
    "20240507",
    "20240506",
    "20240505",
    "20240504",
    "20240503",
    "20240502"
  ]
}